11 Sep, 2016

1 commit


24 Apr, 2016

1 commit


12 Apr, 2016

1 commit

  • Vivek reported a kernel exception deleting a VRF with an active
    connection through it. The root cause is that the socket has a cached
    reference to a dst that is destroyed. Converting the dst_destroy to
    dst_release and letting proper reference counting kick in does not
    work as the dst has a reference to the device which needs to be released
    as well.

    I talked to Hannes about this at netdev and he pointed out the ipv4 and
    ipv6 dst handling has dst_ifdown for just this scenario. Rather than
    continuing with the reinvented dst wheel in VRF just remove it and
    leverage the ipv4 and ipv6 versions.

    Fixes: 193125dbd8eb2 ("net: Introduce VRF device driver")
    Fixes: 35402e3136634 ("net: Add IPv6 support to VRF device")

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

08 Apr, 2016

1 commit

  • In inet_iif check if skb_rtable is NULL for the skb and return
    skb->skb_iif if it is.

    This change allows inet_iif to be called before the dst
    information has been set in the skb (e.g. when doing socket based
    UDP GRO).

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

17 Feb, 2016

1 commit


05 Jan, 2016

1 commit

  • Commands run in a vrf context are not failing as expected on a route lookup:
    root@kenny:~# ip ro ls table vrf-red
    unreachable default

    root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254
    ping: Warning: source address might be selected on device other than vrf-red.
    PING 10.100.1.254 (10.100.1.254) from 0.0.0.0 vrf-red: 56(84) bytes of data.

    --- 10.100.1.254 ping statistics ---
    2 packets transmitted, 0 received, 100% packet loss, time 999ms

    Since the vrf table does not have a route for 10.100.1.254 the ping
    should have failed. The saddr lookup causes a full VRF table lookup.
    Propogating a lookup failure to the user allows the command to fail as
    expected:

    root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254
    connect: No route to host

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

07 Oct, 2015

2 commits


05 Oct, 2015

1 commit

  • ICMP packets are inspected to let them route together with the flow they
    belong to, minimizing the chance that a problematic path will affect flows
    on other paths, and so that anycast environments can work with ECMP.

    Signed-off-by: Peter Nørlund
    Signed-off-by: David S. Miller

    Peter Nørlund
     

30 Sep, 2015

2 commits


27 Sep, 2015

1 commit


26 Sep, 2015

1 commit


18 Sep, 2015

1 commit

  • Steffen reported that the recent change to add oif to dst lookups breaks
    the VTI use case. The problem is that with the oif set in the flow struct
    the comparison to the nh_oif is triggered. Fix by splitting the
    FLOWI_FLAG_VRFSRC into 2 flags -- one that triggers the vrf device cache
    bypass (FLOWI_FLAG_VRFSRC) and another telling the lookup to not compare
    nh oif (FLOWI_FLAG_SKIP_NH_OIF).

    Fixes: 42a7b32b73d6 ("xfrm: Add oif to dst lookups")

    Signed-off-by: David Ahern
    Acked-by: Steffen Klassert
    Signed-off-by: David S. Miller

    David Ahern
     

16 Sep, 2015

1 commit


02 Sep, 2015

1 commit

  • A number of VRF patches used 'int' for table id. It should be u32 to be
    consistent with the rest of the stack.

    Fixes:
    4e3c89920cd3a ("net: Introduce VRF related flags and helpers")
    15be405eb2ea9 ("net: Add inet_addr lookup by table")
    30bbaa1950055 ("net: Fix up inet_addr_type checks")
    021dd3b8a142d ("net: Add routes to the table associated with the device")
    dc028da54ed35 ("inet: Move VRF table lookup to inlined function")
    f6d3c19274c74 ("net: FIB tracepoints")

    Signed-off-by: David Ahern
    Reviewed-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    David Ahern
     

21 Aug, 2015

1 commit

  • Currently, the lwtunnel state resides in per-protocol data. This is
    a problem if we encapsulate ipv6 traffic in an ipv4 tunnel (or vice versa).
    The xmit function of the tunnel does not know whether the packet has been
    routed to it by ipv4 or ipv6, yet it needs the lwtstate data. Moving the
    lwtstate data to dst_entry makes such inter-protocol tunneling possible.

    As a bonus, this brings a nice diffstat.

    Signed-off-by: Jiri Benc
    Acked-by: Roopa Prabhu
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Jiri Benc
     

14 Aug, 2015

3 commits

  • Currently inet_addr_type and inet_dev_addr_type expect local addresses
    to be in the local table. With the VRF device local routes for devices
    associated with a VRF will be in the table associated with the VRF.
    Provide an alternate inet_addr lookup to use a specific table rather
    than defaulting to the local table.

    inet_addr_type_dev_table keeps the same semantics as inet_addr_type but
    if the passed in device is enslaved to a VRF then the table for that VRF
    is used for the lookup.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • Currently inet_addr_type and inet_dev_addr_type expect local addresses
    to be in the local table. With the VRF device local routes for devices
    associated with a VRF will be in the table associated with the VRF.
    Provide an alternate inet_addr lookup to use a specific table rather
    than defaulting to the local table.

    Signed-off-by: Shrijeet Mukherjee
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • As with ingress use the index of VRF master device for route lookups on
    egress. However, the oif should only be used to direct the lookups to a
    specific table. Routes in the table are not based on the VRF device but
    rather interfaces that are part of the VRF so do not consider the oif for
    lookups within the table. The FLOWI_FLAG_VRFSRC is used to control this
    latter part.

    Signed-off-by: Shrijeet Mukherjee
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

22 Jul, 2015

1 commit


16 Jan, 2015

1 commit

  • RAW sockets with hdrinc suffer from contention on rt_uncached_lock
    spinlock.

    One solution is to use percpu lists, since most routes are destroyed
    by the cpu that created them.

    It is unclear why we even have to put these routes in uncached_list,
    as all outgoing packets should be freed when a device is dismantled.

    Signed-off-by: Eric Dumazet
    Fixes: caacf05e5ad1 ("ipv4: Properly purge netdev references on uncached routes.")
    Signed-off-by: David S. Miller

    Eric Dumazet
     

25 Mar, 2014

1 commit


14 Jan, 2014

1 commit

  • While forwarding we should not use the protocol path mtu to calculate
    the mtu for a forwarded packet but instead use the interface mtu.

    We mark forwarded skbs in ip_forward with IPSKB_FORWARDED, which was
    introduced for multicast forwarding. But as it does not conflict with
    our usage in unicast code path it is perfect for reuse.

    I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu
    along with the new ip_dst_mtu_maybe_forward to net/ip.h to fix circular
    dependencies because of IPSKB_FORWARDED.

    Because someone might have written a software which does probe
    destinations manually and expects the kernel to honour those path mtus
    I introduced a new per-namespace "ip_forward_use_pmtu" knob so someone
    can disable this new behaviour. We also still use mtus which are locked on a
    route for forwarding.

    The reason for this change is, that path mtus information can be injected
    into the kernel via e.g. icmp_err protocol handler without verification
    of local sockets. As such, this could cause the IPv4 forwarding path to
    wrongfully emit fragmentation needed notifications or start to fragment
    packets along a path.

    Tunnel and ipsec output paths clear IPCB again, thus IPSKB_FORWARDED
    won't be set and further fragmentation logic will use the path mtu to
    determine the fragmentation size. They also recheck packet size with
    help of path mtu discovery and report appropriate errors.

    Cc: Eric Dumazet
    Cc: David Miller
    Cc: John Heffner
    Cc: Steffen Klassert
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

06 Dec, 2013

1 commit


06 Nov, 2013

1 commit

  • Sockets marked with IP_PMTUDISC_INTERFACE won't do path mtu discovery,
    their sockets won't accept and install new path mtu information and they
    will always use the interface mtu for outgoing packets. It is guaranteed
    that the packet is not fragmented locally. But we won't set the DF-Flag
    on the outgoing frames.

    Florian Weimer had the idea to use this flag to ensure DNS servers are
    never generating outgoing fragments. They may well be fragmented on the
    path, but the server never stores or usees path mtu values, which could
    well be forged in an attack.

    (The root of the problem with path MTU discovery is that there is
    no reliable way to authenticate ICMP Fragmentation Needed But DF Set
    messages because they are sent from intermediate routers with their
    source addresses, and the IMCP payload will not always contain sufficient
    information to identify a flow.)

    Recent research in the DNS community showed that it is possible to
    implement an attack where DNS cache poisoning is feasible by spoofing
    fragments. This work was done by Amir Herzberg and Haya Shulman:

    This issue was previously discussed among the DNS community, e.g.
    ,
    without leading to fixes.

    This patch depends on the patch "ipv4: fix DO and PROBE pmtu mode
    regarding local fragmentation with UFO/CORK" for the enforcement of the
    non-fragmentable checks. If other users than ip_append_page/data should
    use this semantic too, we have to add a new flag to IPCB(skb)->flags to
    suppress local fragmentation and check for this in ip_finish_output.

    Many thanks to Florian Weimer for the idea and feedback while implementing
    this patch.

    Cc: David S. Miller
    Suggested-by: Florian Weimer
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

18 Oct, 2013

1 commit


29 Sep, 2013

1 commit

  • If IP_TOS or IP_TTL are specified as ancillary data, then sendmsg() sends out
    packets with the specified TTL or TOS overriding the socket values specified
    with the traditional setsockopt().

    The struct inet_cork stores the values of TOS, TTL and priority that are
    passed through the struct ipcm_cookie. If there are user-specified TOS
    (tos != -1) or TTL (ttl != 0) in the struct ipcm_cookie, these values are
    used to override the per-socket values. In case of TOS also the priority
    is changed accordingly.

    Two helper functions get_rttos and get_rtconn_flags are defined to take
    into account the presence of a user specified TOS value when computing
    RT_TOS and RT_CONN_FLAGS.

    Signed-off-by: Francesco Fusco
    Signed-off-by: David S. Miller

    Francesco Fusco
     

23 Sep, 2013

1 commit

  • There are a mix of function prototypes with and without extern
    in the kernel sources. Standardize on not using extern for
    function prototypes.

    Function prototypes don't need to be written with extern.
    extern is assumed by the compiler. Its use is as unnecessary as
    using auto to declare automatic/local variables in a block.

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     

14 Aug, 2013

1 commit

  • skb->sk socket can be of AF_INET or AF_INET6 address family. Thus we
    always have to make sure we a referring to the correct interpretation
    of skb->sk.

    We only depend on header defines to query the mtu, so we don't introduce
    a new dependency to ipv6 by this change.

    Cc: Steffen Klassert
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: Steffen Klassert

    Hannes Frederic Sowa
     

04 Nov, 2012

1 commit

  • We can save a test in ip_rt_put(), considering dst_release() accepts
    a NULL parameter, and dst is first element in rtable.

    Add a BUILD_BUG_ON() to catch any change that could break this
    assertion.

    Signed-off-by: Eric Dumazet
    Cc: Cong Wang
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Oct, 2012

1 commit

  • Add new flag to remember when route is via gateway.
    We will use it to allow rt_gateway to contain address of
    directly connected host for the cases when DST_NOCACHE is
    used or when the NH exception caches per-destination route
    without DST_NOCACHE flag, i.e. when routes are not used for
    other destinations. By this way we force the neighbour
    resolving to work with the routed destination but we
    can use different address in the packet, feature needed
    for IPVS-DR where original packet for virtual IP is routed
    via route to real IP.

    Signed-off-by: Julian Anastasov
    Signed-off-by: David S. Miller

    Julian Anastasov
     

19 Sep, 2012

1 commit


01 Aug, 2012

1 commit

  • When a device is unregistered, we have to purge all of the
    references to it that may exist in the entire system.

    If a route is uncached, we currently have no way of accomplishing
    this.

    So create a global list that is scanned when a network device goes
    down. This mirrors the logic in net/core/dst.c's dst_ifdown().

    Signed-off-by: David S. Miller

    David S. Miller
     

27 Jul, 2012

1 commit

  • With the routing cache removal we lost the "noref" code paths on
    input, and this can kill some routing workloads.

    Reinstate the noref path when we hit a cached route in the FIB
    nexthops.

    With help from Eric Dumazet.

    Reported-by: Alexander Duyck
    Signed-off-by: David S. Miller
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    David S. Miller
     

24 Jul, 2012

1 commit

  • On input packet processing, rt->rt_iif will be zero if we should
    use skb->dev->ifindex.

    Since we access rt->rt_iif consistently via inet_iif(), that is
    the only spot whose interpretation have to adjust.

    Signed-off-by: David S. Miller

    David S. Miller
     

21 Jul, 2012

4 commits

  • It's not really needed.

    We only grabbed a reference to the fib_info for the sake of fib_info
    local metrics.

    However, fib_info objects are freed using RCU, as are therefore their
    private metrics (if any).

    We would have triggered a route cache flush if we eliminated a
    reference to a fib_info object in the routing tables.

    Therefore, any existing cached routes will first check and see that
    they have been invalidated before an errant reference to these
    metric values would occur.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • That is this value's only use, as a boolean to indicate whether
    a route is an input route or not.

    So implement it that way, using a u16 gap present in the struct
    already.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Never actually used.

    It was being set on output routes to the original OIF specified in the
    flow key used for the lookup.

    Adjust the only user, ipmr_rt_fib_lookup(), for greater correctness of
    the flowi4_oif and flowi4_iif values, thanks to feedback from Julian
    Anastasov.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • In order to allow prefixed routes, we have to adjust how rt_gateway
    is set and interpreted.

    The new interpretation is:

    1) rt_gateway == 0, destination is on-link, nexthop is iph->daddr

    2) rt_gateway != 0, destination requires a nexthop gateway

    Abstract the fetching of the proper nexthop value using a new
    inline helper, rt_nexthop(), as suggested by Joe Perches.

    Signed-off-by: David S. Miller
    Tested-by: Vijay Subramanian

    David S. Miller