24 Aug, 2016

1 commit

  • TFO_SERVER_WO_SOCKOPT2 was intended for debugging purposes during
    Fast Open development. Remove this config option and also
    update/clean-up the documentation of the Fast Open sysctl.

    Reported-by: Piotr Jurkiewicz
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

30 May, 2016

1 commit


12 Apr, 2016

1 commit

  • Multipath route lookups should consider knowledge about next hops and not
    select a hop that is known to be failed.

    Example:

    [h2] [h3] 15.0.0.5
    | |
    3| 3|
    [SP1] [SP2]--+
    1 2 1 2
    | | /-------------+ |
    | \ / |
    | X |
    | / \ |
    | / \---------------\ |
    1 2 1 2
    12.0.0.2 [TOR1] 3-----------------3 [TOR2] 12.0.0.3
    4 4
    \ /
    \ /
    \ /
    -------| |-----/
    1 2
    [TOR3]
    3|
    |
    [h1] 12.0.0.1

    host h1 with IP 12.0.0.1 has 2 paths to host h3 at 15.0.0.5:

    root@h1:~# ip ro ls
    ...
    12.0.0.0/24 dev swp1 proto kernel scope link src 12.0.0.1
    15.0.0.0/16
    nexthop via 12.0.0.2 dev swp1 weight 1
    nexthop via 12.0.0.3 dev swp1 weight 1
    ...

    If the link between tor3 and tor1 is down and the link between tor1
    and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups
    in h1 are alternating between the 2 routes: ping 15.0.0.5 gets one and
    ssh 15.0.0.5 gets the other. Connections that attempt to use the
    12.0.0.2 nexthop fail since that neighbor is not reachable:

    root@h1:~# ip neigh show
    ...
    12.0.0.3 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE
    12.0.0.2 dev swp1 FAILED
    ...

    The failed path can be avoided by considering known neighbor information
    when selecting next hops. If the neighbor lookup fails we have no
    knowledge about the nexthop, so give it a shot. If there is an entry
    then only select the nexthop if the state is sane. This is similar to
    what fib_detect_death does.

    To maintain backward compatibility use of the neighbor information is
    based on a new sysctl, fib_multipath_use_neigh.

    Signed-off-by: David Ahern
    Reviewed-by: Julian Anastasov
    Signed-off-by: David S. Miller

    David Ahern
     

22 Mar, 2016

2 commits


26 Feb, 2016

1 commit

  • Currently, all ipv6 addresses are flushed when the interface is configured
    down, including global, static addresses:

    $ ip -6 addr show dev eth1
    3: eth1: mtu 1500 state UP qlen 1000
    inet6 2100:1::2/120 scope global
    valid_lft forever preferred_lft forever
    inet6 fe80::e0:f9ff:fe79:34bd/64 scope link
    valid_lft forever preferred_lft forever
    $ ip link set dev eth1 down
    $ ip -6 addr show dev eth1
    << nothing; all addresses have been flushed>>

    Add a new sysctl to make this behavior optional. The new setting defaults to
    flush all addresses to maintain backwards compatibility. When the set global
    addresses with no expire times are not flushed on an admin down. The sysctl
    is per-interface or system-wide for all interfaces

    $ sysctl -w net.ipv6.conf.eth1.keep_addr_on_down=1
    or
    $ sysctl -w net.ipv6.conf.all.keep_addr_on_down=1

    Will keep addresses on eth1 on an admin down.

    $ ip -6 addr show dev eth1
    3: eth1: mtu 1500 state UP qlen 1000
    inet6 2100:1::2/120 scope global
    valid_lft forever preferred_lft forever
    inet6 fe80::e0:f9ff:fe79:34bd/64 scope link
    valid_lft forever preferred_lft forever
    $ ip link set dev eth1 down
    $ ip -6 addr show dev eth1
    3: eth1: mtu 1500 state DOWN qlen 1000
    inet6 2100:1::2/120 scope global tentative
    valid_lft forever preferred_lft forever
    inet6 fe80::e0:f9ff:fe79:34bd/64 scope link tentative
    valid_lft forever preferred_lft forever

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

11 Feb, 2016

4 commits

  • In certain 802.11 wireless deployments, there will be NA proxies
    that use knowledge of the network to correctly answer requests.
    To prevent unsolicitd advertisements on the shared medium from
    being a problem, on such deployments wireless needs to drop them.

    Enable this by providing an option called "drop_unsolicited_na".

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • In order to solve a problem with 802.11, the so-called hole-196 attack,
    add an option (sysctl) called "drop_unicast_in_l2_multicast" which, if
    enabled, causes the stack to drop IPv6 unicast packets encapsulated in
    link-layer multi- or broadcast frames. Such frames can (as an attack)
    be created by any member of the same wireless network and transmitted
    as valid encrypted frames since the symmetric key for broadcast frames
    is shared between all stations.

    Reviewed-by: Julian Anastasov
    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • In certain 802.11 wireless deployments, there will be ARP proxies
    that use knowledge of the network to correctly answer requests.
    To prevent gratuitous ARP frames on the shared medium from being
    a problem, on such deployments wireless needs to drop them.

    Enable this by providing an option called "drop_gratuitous_arp".

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • In order to solve a problem with 802.11, the so-called hole-196 attack,
    add an option (sysctl) called "drop_unicast_in_l2_multicast" which, if
    enabled, causes the stack to drop IPv4 unicast packets encapsulated in
    link-layer multi- or broadcast frames. Such frames can (as an attack)
    be created by any member of the same wireless network and transmitted
    as valid encrypted frames since the symmetric key for broadcast frames
    is shared between all stations.

    Additionally, enabling this option provides compliance with a SHOULD
    clause of RFC 1122.

    Reviewed-by: Julian Anastasov
    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     

21 Jan, 2016

1 commit


19 Dec, 2015

1 commit

  • Allow accepted sockets to derive their sk_bound_dev_if setting from the
    l3mdev domain in which the packets originated. A sysctl setting is added
    to control the behavior which is similar to sk_mark and
    sysctl_tcp_fwmark_accept.

    This effectively allow a process to have a "VRF-global" listen socket,
    with child sockets bound to the VRF device in which the packet originated.
    A similar behavior can be achieved using sk_mark, but a solution using marks
    is incomplete as it does not handle duplicate addresses in different L3
    domains/VRFs. Allowing sockets to inherit the sk_bound_dev_if from l3mdev
    domain provides a complete solution.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

16 Dec, 2015

1 commit

  • As we all know, the value of pf_retrans >= max_retrans_path can
    disable pf state. The variables of pf_retrans and max_retrans_path
    can be changed by the userspace application.

    Sometimes the user expects to disable pf state while the 2
    variables are changed to enable pf state. So it is necessary to
    introduce a new variable to disable pf state.

    According to the suggestions from Vlad Yasevich, extra1 and extra2
    are removed. The initialization of pf_enable is added.

    Acked-by: Vlad Yasevich
    Signed-off-by: Zhu Yanjun
    Acked-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Zhu Yanjun
     

10 Nov, 2015

1 commit


30 Oct, 2015

1 commit

  • Steffen Klassert says:

    ====================
    pull request (net-next): ipsec-next 2015-10-30

    1) The flow cache is limited by the flow cache limit which
    depends on the number of cpus and the xfrm garbage collector
    threshold which is independent of the number of cpus. This
    leads to the fact that on systems with more than 16 cpus
    we hit the xfrm garbage collector limit and refuse new
    allocations, so new flows are dropped. On systems with 16
    or less cpus, we hit the flowcache limit. In this case, we
    shrink the flow cache instead of refusing new flows.

    We increase the xfrm garbage collector threshold to INT_MAX
    to get the same behaviour, independent of the number of cpus.

    2) Fix some unaligned accesses on sparc systems.
    From Sowmini Varadhan.

    3) Fix some header checks in _decode_session4. We may call
    pskb_may_pull with a negative value converted to unsigened
    int from pskb_may_pull. This can lead to incorrect policy
    lookups. We fix this by a check of the data pointer position
    before we call pskb_may_pull.

    4) Reload skb header pointers after calling pskb_may_pull
    in _decode_session4 as this may change the pointers into
    the packet.

    5) Add a missing statistic counter on inner mode errors.

    Please pull or let me know if there are problems.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

21 Oct, 2015

2 commits

  • This patch implements the second half of RACK that uses the the most
    recent transmit time among all delivered packets to detect losses.

    tcp_rack_mark_lost() is called upon receiving a dubious ACK.
    It then checks if an not-yet-sacked packet was sent at least
    "reo_wnd" prior to the sent time of the most recently delivered.
    If so the packet is deemed lost.

    The "reo_wnd" reordering window starts with 1msec for fast loss
    detection and changes to min-RTT/4 when reordering is observed.
    We found 1msec accommodates well on tiny degree of reordering
    (
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Kathleen Nichols' algorithm for tracking the minimum RTT of a
    data stream over some measurement window. It uses constant space
    and constant time per update. Yet it almost always delivers
    the same minimum as an implementation that has to keep all
    the data in the window. The measurement window is tunable via
    sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.

    The algorithm keeps track of the best, 2nd best & 3rd best min
    values, maintaining an invariant that the measurement time of
    the n'th best >= n-1'th best. It also makes sure that the three
    values are widely separated in the time window since that bounds
    the worse case error when that data is monotonically increasing
    over the window.

    Upon getting a new min, we can forget everything earlier because
    it has no value - the new min is less than everything else in the
    window by definition and it's the most recent. So we restart fresh
    on every new min and overwrites the 2nd & 3rd choices. The same
    property holds for the 2nd & 3rd best.

    Therefore we have to maintain two invariants to maximize the
    information in the samples, one on values (1st.v
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

14 Oct, 2015

1 commit

  • Revert the commit e2ca690b657f ("ipv4/icmp: redirect messages
    can use the ingress daddr as source"), which tried to introduce a more
    suitable behaviour for ICMP redirect messages generated by VRRP routers.
    However RFC 5798 section 8.1.1 states:

    The IPv4 source address of an ICMP redirect should be the address
    that the end-host used when making its next-hop routing decision.

    while said commit used the generating packet destination
    address, which do not match the above and in most cases leads to
    no redirect packets to be generated.

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

13 Oct, 2015

1 commit

  • This patch allows configuring how the source address of ICMP
    redirect messages is selected; by default the old behaviour is
    retained, while setting icmp_redirects_use_orig_daddr force the
    usage of the destination address of the packet that caused the
    redirect.

    The new behaviour fits closely the RFC 5798 section 8.1.1, and fix the
    following scenario:

    Two machines are set up with VRRP to act as routers out of a subnet,
    they have IPs x.x.x.1/24 and x.x.x.2/24, with VRRP holding on to
    x.x.x.254/24.

    If a host in said subnet needs to get an ICMP redirect from the VRRP
    router, i.e. to reach a destination behind a different gateway, the
    source IP in the ICMP redirect is chosen as the primary IP on the
    interface that the packet arrived at, i.e. x.x.x.1 or x.x.x.2.

    The host will then ignore said redirect, due to RFC 1122 section 3.2.2.2,
    and will continue to use the wrong next-op.

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

29 Sep, 2015

1 commit

  • The xfrm flowcache size is limited by the flowcache limit
    (4096 * number of online cpus) and the xfrm garbage collector
    threshold (2 * 32768), whatever is reached first. This means
    that we can hit the garbage collector limit only on systems
    with more than 16 cpus. On such systems we simply refuse
    new allocations if we reach the limit, so new flows are dropped.
    On syslems with 16 or less cpus, we hit the flowcache limit.
    In this case, we shrink the flow cache instead of refusing new
    flows.

    We increase the xfrm garbage collector threshold to INT_MAX
    to get the same behaviour, independent of the number of cpus.

    The xfrm garbage collector threshold can still be set below
    the flowcache limit to reduce the memory usage of the flowcache.

    Tested-by: Dan Streetman
    Signed-off-by: Steffen Klassert

    Steffen Klassert
     

01 Sep, 2015

1 commit

  • Document the addition of a new sysctl variable which controls the
    generation of IGMP reports for link local multicast groups in the
    224.0.0.X range.

    IGMP reports for local multicast groups can now be optionally
    inhibited by setting the value to zero e.g.:
    echo 0 > /proc/sys/net/ipv4/igmp_link_local_mcast_reports

    To retain backwards compatibility the previous behaviour is retained
    by default on system boot or reverted by setting the value back to
    non-zero.

    Signed-off-by: Philip Downey
    Signed-off-by: David S. Miller

    Philip Downey
     

26 Aug, 2015

1 commit

  • When TCP pacing was added back in linux-3.12, we chose
    to apply a fixed ratio of 200 % against current rate,
    to allow probing for optimal throughput even during
    slow start phase, where cwnd can be doubled every other gRTT.

    At Google, we found it was better applying a different ratio
    while in Congestion Avoidance phase.
    This ratio was set to 120 %.

    We've used the normal tcp_in_slow_start() helper for a while,
    then tuned the condition to select the conservative ratio
    as soon as cwnd >= ssthresh/2 :

    - After cwnd reduction, it is safer to ramp up more slowly,
    as we approach optimal cwnd.
    - Initial ramp up (ssthresh == INFINITY) still allows doubling
    cwnd every other RTT.

    Signed-off-by: Eric Dumazet
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 Aug, 2015

1 commit


01 Aug, 2015

2 commits

  • Initialize auto_flowlabels to one. This enables automatic flow labels,
    individual socket may disable them using the IPV6_AUTOFLOWLABEL socket
    option.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • Change the meaning of net.ipv6.auto_flowlabels to provide a mode for
    automatic flow labels generation. There are four modes:

    0: flow labels are disabled
    1: flow labels are enabled, sockets can opt-out
    2: flow labels are allowed, sockets can opt-in
    3: flow labels are enabled and enforced, no opt-out for sockets

    np->autoflowlabel is initialized according to the sysctl value.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

31 Jul, 2015

1 commit

  • Commit 6fd99094de2b ("ipv6: Don't reduce hop limit for an interface")
    disabled accept hop limit from RA if it is smaller than the current hop
    limit for security stuff. But this behavior kind of break the RFC definition.

    RFC 4861, 6.3.4. Processing Received Router Advertisements
    A Router Advertisement field (e.g., Cur Hop Limit, Reachable Time,
    and Retrans Timer) may contain a value denoting that it is
    unspecified. In such cases, the parameter should be ignored and the
    host should continue using whatever value it is already using.

    If the received Cur Hop Limit value is non-zero, the host SHOULD set
    its CurHopLimit variable to the received value.

    So add sysctl option accept_ra_min_hop_limit to let user choose the minimum
    hop limit value they can accept from RA. And set default to 1 to meet RFC
    standards.

    Signed-off-by: Hangbin Liu
    Acked-by: YOSHIFUJI Hideaki
    Signed-off-by: David S. Miller

    Hangbin Liu
     

23 Jul, 2015

1 commit

  • Per RFC 6724, section 4, "Candidate Source Addresses":

    It is RECOMMENDED that the candidate source addresses be the set
    of unicast addresses assigned to the interface that will be used
    to send to the destination (the "outgoing" interface).

    Add a sysctl to enable this behaviour.

    Signed-off-by: Erik Kline
    Signed-off-by: David S. Miller

    Erik Kline
     

10 Jul, 2015

1 commit

  • Add support to allow non-local binds similar to how this was done for IPv4.
    Non-local binds are very useful in emulating the Internet in a box, etc.

    This add the ip_nonlocal_bind sysctl under ipv6.

    Testing:

    Set up nonlocal binding and receive routing on a host, e.g.:

    ip -6 rule add from ::/0 iif eth0 lookup 200
    ip -6 route add local 2001:0:0:1::/64 dev lo proto kernel scope host table 200
    sysctl -w net.ipv6.ip_nonlocal_bind=1

    Set up routing to 2001:0:0:1::/64 on peer to go to first host

    ping6 -I 2001:0:0:1::1 peer-address -- to verify

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

28 May, 2015

1 commit

  • A long standing problem on busy servers is the tiny available TCP port
    range (/proc/sys/net/ipv4/ip_local_port_range) and the default
    sequential allocation of source ports in connect() system call.

    If a host is having a lot of active TCP sessions, chances are
    very high that all ports are in use by at least one flow,
    and subsequent bind(0) attempts fail, or have to scan a big portion of
    space to find a slot.

    In this patch, I changed the starting point in __inet_hash_connect()
    so that we try to favor even [1] ports, leaving odd ports for bind()
    users.

    We still perform a sequential search, so there is no guarantee, but
    if connect() targets are very different, end result is we leave
    more ports available to bind(), and we spread them all over the range,
    lowering time for both connect() and bind() to find a slot.

    This strategy only works well if /proc/sys/net/ipv4/ip_local_port_range
    is even, ie if start/end values have different parity.

    Therefore, default /proc/sys/net/ipv4/ip_local_port_range was changed to
    32768 - 60999 (instead of 32768 - 61000)

    There is no change on security aspects here, only some poor hashing
    schemes could be eventually impacted by this change.

    [1] : The odd/even property depends on ip_local_port_range values parity

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 May, 2015

1 commit

  • This work as a follow-up of commit f7b3bec6f516 ("net: allow setting ecn
    via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
    ECN connections. In other words, this work adds a retry with a non-ECN
    setup SYN packet, as suggested from the RFC on the first timeout:

    [...] A host that receives no reply to an ECN-setup SYN within the
    normal SYN retransmission timeout interval MAY resend the SYN and
    any subsequent SYN retransmissions with CWR and ECE cleared. [...]

    Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
    that is, Linux default since 2009 via commit 255cac91c3c9 ("tcp: extend
    ECN sysctl to allow server-side only ECN"):

    1) Normal ECN-capable path:

    SYN ECE CWR ----->

    2) Path with broken middlebox, when client has fallback:

    SYN ECE CWR ----X crappy middlebox drops packet
    (timeout, rtx)
    SYN ----->

    In case we would not have the fallback implemented, the middlebox drop
    point would basically end up as:

    SYN ECE CWR ----X crappy middlebox drops packet
    (timeout, rtx)
    SYN ECE CWR ----X crappy middlebox drops packet
    (timeout, rtx)
    SYN ECE CWR ----X crappy middlebox drops packet
    (timeout, rtx)

    In any case, it's rather a smaller percentage of sites where there would
    occur such additional setup latency: it was found in end of 2014 that ~56%
    of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
    ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
    when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
    fallback would mitigate with a slight latency trade-off. Recent related
    paper on this topic:

    Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
    Gorry Fairhurst, and Richard Scheffenegger:
    "Enabling Internet-Wide Deployment of Explicit Congestion Notification."
    Proc. PAM 2015, New York.
    http://ecn.ethz.ch/ecn-pam15.pdf

    Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
    section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
    which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
    allows for disabling the fallback.

    tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
    rather we let tcp_ecn_rcv_synack() take that over on input path in case a
    SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
    ECN being negotiated eventually in that case.

    Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
    Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Florian Westphal
    Signed-off-by: Mirja Kühlewind
    Signed-off-by: Brian Trammell
    Cc: Eric Dumazet
    Cc: Dave That
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

04 May, 2015

1 commit

  • This patch divides the IPv6 flow label space into two ranges:
    0-7ffff is reserved for flow label manager, 80000-fffff will be
    used for creating auto flow labels (per RFC6438). This only affects how
    labels are set on transmit, it does not affect receive. This range split
    can be disbaled by systcl.

    Background:

    IPv6 flow labels have been an unmitigated disappointment thus far
    in the lifetime of IPv6. Support in HW devices to use them for ECMP
    is lacking, and OSes don't turn them on by default. If we had these
    we could get much better hashing in IPv6 networks without resorting
    to DPI, possibly eliminating some of the motivations to to define new
    encaps in UDP just for getting ECMP.

    Unfortunately, the initial specfications of IPv6 did not clarify
    how they are to be used. There has always been a vague concept that
    these can be used for ECMP, flow hashing, etc. and we do now have a
    good standard how to this in RFC6438. The problem is that flow labels
    can be either stateful or stateless (as in RFC6438), and we are
    presented with the possibility that a stateless label may collide
    with a stateful one. Attempts to split the flow label space were
    rejected in IETF. When we added support in Linux for RFC6438, we
    could not turn on flow labels by default due to this conflict.

    This patch splits the flow label space and should give us
    a path to enabling auto flow labels by default for all IPv6 packets.
    This is an API change so we need to consider compatibility with
    existing deployment. The stateful range is chosen to be the lower
    values in hopes that most uses would have chosen small numbers.

    Once we resolve the stateless/stateful issue, we can proceed to
    look at enabling RFC6438 flow labels by default (starting with
    scaled testing).

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

24 Mar, 2015

1 commit


21 Mar, 2015

1 commit


07 Mar, 2015

1 commit


08 Feb, 2015

1 commit

  • Helpers for mitigating ACK loops by rate-limiting dupacks sent in
    response to incoming out-of-window packets.

    This patch includes:

    - rate-limiting logic
    - sysctl to control how often we allow dupacks to out-of-window packets
    - SNMP counter for cases where we rate-limited our dupack sending

    The rate-limiting logic in this patch decides to not send dupacks in
    response to out-of-window segments if (a) they are SYNs or pure ACKs
    and (b) the remote endpoint is sending them faster than the configured
    rate limit.

    We rate-limit our responses rather than blocking them entirely or
    resetting the connection, because legitimate connections can rely on
    dupacks in response to some out-of-window segments. For example, zero
    window probes are typically sent with a sequence number that is below
    the current window, and ZWPs thus expect to thus elicit a dupack in
    response.

    We allow dupacks in response to TCP segments with data, because these
    may be spurious retransmissions for which the remote endpoint wants to
    receive DSACKs. This is safe because segments with data can't
    realistically be part of ACK loops, which by their nature consist of
    each side sending pure/data-less ACKs to each other.

    The dupack interval is controlled by a new sysctl knob,
    tcp_invalid_ratelimit, given in milliseconds, in case an administrator
    needs to dial this upward in the face of a high-rate DoS attack. The
    name and units are chosen to be analogous to the existing analogous
    knob for ICMP, icmp_ratelimit.

    The default value for tcp_invalid_ratelimit is 500ms, which allows at
    most one such dupack per 500ms. This is chosen to be 2x faster than
    the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule
    2.4). We allow the extra 2x factor because network delay variations
    can cause packets sent at 1 second intervals to be compressed and
    arrive much closer.

    Reported-by: Avery Fay
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     

26 Jan, 2015

1 commit

  • The kernel forcefully applies MTU values received in router
    advertisements provided the new MTU is less than the current. This
    behavior is undesirable when the user space is managing the MTU. Instead
    a sysctl flag 'accept_ra_mtu' is introduced such that the user space
    can control whether or not RA provided MTU updates should be applied. The
    default behavior is unchanged; user space must explicitly set this flag
    to 0 for RA MTUs to be ignored.

    Signed-off-by: Harout Hedeshian
    Signed-off-by: David S. Miller

    Harout Hedeshian
     

13 Jan, 2015

1 commit


07 Nov, 2014

1 commit


06 Nov, 2014

1 commit


30 Oct, 2014

1 commit

  • Add a sysctl that causes an interface's optimistic addresses
    to be considered equivalent to other non-deprecated addresses
    for source address selection purposes. Preferred addresses
    will still take precedence over optimistic addresses, subject
    to other ranking in the source address selection algorithm.

    This is useful where different interfaces are connected to
    different networks from different ISPs (e.g., a cell network
    and a home wifi network).

    The current behaviour complies with RFC 3484/6724, and it
    makes sense if the host has only one interface, or has
    multiple interfaces on the same network (same or cooperating
    administrative domain(s), but not in the multiple distinct
    networks case.

    For example, if a mobile device has an IPv6 address on an LTE
    network and then connects to IPv6-enabled wifi, while the wifi
    IPv6 address is undergoing DAD, IPv6 connections will try use
    the wifi default route with the LTE IPv6 address, and will get
    stuck until they time out.

    Also, because optimistic nodes can receive frames, issue
    an RTM_NEWADDR as soon as DAD starts (with the IFA_F_OPTIMSTIC
    flag appropriately set). A second RTM_NEWADDR is sent if DAD
    completes (the address flags have changed), otherwise an
    RTM_DELADDR is sent.

    Also: add an entry in ip-sysctl.txt for optimistic_dad.

    Signed-off-by: Erik Kline
    Acked-by: Lorenzo Colitti
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Erik Kline