29 Nov, 2016

1 commit


21 Sep, 2016

1 commit

  • This commit implements a new TCP congestion control algorithm: BBR
    (Bottleneck Bandwidth and RTT). A detailed description of BBR will be
    published in ACM Queue, Vol. 14 No. 5, September-October 2016, as
    "BBR: Congestion-Based Congestion Control".

    BBR has significantly increased throughput and reduced latency for
    connections on Google's internal backbone networks and google.com and
    YouTube Web servers.

    BBR requires only changes on the sender side, not in the network or
    the receiver side. Thus it can be incrementally deployed on today's
    Internet, or in datacenters.

    The Internet has predominantly used loss-based congestion control
    (largely Reno or CUBIC) since the 1980s, relying on packet loss as the
    signal to slow down. While this worked well for many years, loss-based
    congestion control is unfortunately out-dated in today's networks. On
    today's Internet, loss-based congestion control causes the infamous
    bufferbloat problem, often causing seconds of needless queuing delay,
    since it fills the bloated buffers in many last-mile links. On today's
    high-speed long-haul links using commodity switches with shallow
    buffers, loss-based congestion control has abysmal throughput because
    it over-reacts to losses caused by transient traffic bursts.

    In 1981 Kleinrock and Gale showed that the optimal operating point for
    a network maximizes delivered bandwidth while minimizing delay and
    loss, not only for single connections but for the network as a
    whole. Finding that optimal operating point has been elusive, since
    any single network measurement is ambiguous: network measurements are
    the result of both bandwidth and propagation delay, and those two
    cannot be measured simultaneously.

    While it is impossible to disambiguate any single bandwidth or RTT
    measurement, a connection's behavior over time tells a clearer
    story. BBR uses a measurement strategy designed to resolve this
    ambiguity. It combines these measurements with a robust servo loop
    using recent control systems advances to implement a distributed
    congestion control algorithm that reacts to actual congestion, not
    packet loss or transient queue delay, and is designed to converge with
    high probability to a point near the optimal operating point.

    In a nutshell, BBR creates an explicit model of the network pipe by
    sequentially probing the bottleneck bandwidth and RTT. On the arrival
    of each ACK, BBR derives the current delivery rate of the last round
    trip, and feeds it through a windowed max-filter to estimate the
    bottleneck bandwidth. Conversely it uses a windowed min-filter to
    estimate the round trip propagation delay. The max-filtered bandwidth
    and min-filtered RTT estimates form BBR's model of the network pipe.

    Using its model, BBR sets control parameters to govern sending
    behavior. The primary control is the pacing rate: BBR applies a gain
    multiplier to transmit faster or slower than the observed bottleneck
    bandwidth. The conventional congestion window (cwnd) is now the
    secondary control; the cwnd is set to a small multiple of the
    estimated BDP (bandwidth-delay product) in order to allow full
    utilization and bandwidth probing while bounding the potential amount
    of queue at the bottleneck.

    When a BBR connection starts, it enters STARTUP mode and applies a
    high gain to perform an exponential search to quickly probe the
    bottleneck bandwidth (doubling its sending rate each round trip, like
    slow start). However, instead of continuing until it fills up the
    buffer (i.e. a loss), or until delay or ACK spacing reaches some
    threshold (like Hystart), it uses its model of the pipe to estimate
    when that pipe is full: it estimates the pipe is full when it notices
    the estimated bandwidth has stopped growing. At that point it exits
    STARTUP and enters DRAIN mode, where it reduces its pacing rate to
    drain the queue it estimates it has created.

    Then BBR enters steady state. In steady state, PROBE_BW mode cycles
    between first pacing faster to probe for more bandwidth, then pacing
    slower to drain any queue that created if no more bandwidth was
    available, and then cruising at the estimated bandwidth to utilize the
    pipe without creating excess queue. Occasionally, on an as-needed
    basis, it sends significantly slower to probe for RTT (PROBE_RTT
    mode).

    BBR has been fully deployed on Google's wide-area backbone networks
    and we're experimenting with BBR on Google.com and YouTube on a global
    scale. Replacing CUBIC with BBR has resulted in significant
    improvements in network latency and application (RPC, browser, and
    video) metrics. For more details please refer to our upcoming ACM
    Queue publication.

    Example performance results, to illustrate the difference between BBR
    and CUBIC:

    Resilience to random loss (e.g. from shallow buffers):
    Consider a netperf TCP_STREAM test lasting 30 secs on an emulated
    path with a 10Gbps bottleneck, 100ms RTT, and 1% packet loss
    rate. CUBIC gets 3.27 Mbps, and BBR gets 9150 Mbps (2798x higher).

    Low latency with the bloated buffers common in today's last-mile links:
    Consider a netperf TCP_STREAM test lasting 120 secs on an emulated
    path with a 10Mbps bottleneck, 40ms RTT, and 1000-packet bottleneck
    buffer. Both fully utilize the bottleneck bandwidth, but BBR
    achieves this with a median RTT 25x lower (43 ms instead of 1.09
    secs).

    Our long-term goal is to improve the congestion control algorithms
    used on the Internet. We are hopeful that BBR can help advance the
    efforts toward this goal, and motivate the community to do further
    research.

    Test results, performance evaluations, feedback, and BBR-related
    discussions are very welcome in the public e-mail list for BBR:

    https://groups.google.com/forum/#!forum/bbr-dev

    NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing
    enabled, since pacing is integral to the BBR design and
    implementation. BBR without pacing would not function properly, and
    may incur unnecessary high packet loss rates.

    Signed-off-by: Van Jacobson
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Nandita Dukkipati
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Neal Cardwell
     

11 Jun, 2016

1 commit

  • TCP-NV (New Vegas) is a major update to TCP-Vegas.
    An earlier version of NV was presented at 2010's LPC.
    It is a delayed based congestion avoidance for the
    data center. This version has been tested within a
    10G rack where the HW RTTs are 20-50us and with
    1 to 400 flows.

    A description of TCP-NV, including implementation
    details as well as experimental results, can be found at:
    http://www.brakmo.org/networking/tcp-nv/TCPNV.html

    Signed-off-by: Lawrence Brakmo
    Signed-off-by: David S. Miller

    Lawrence Brakmo
     

18 Feb, 2016

1 commit


17 Feb, 2016

1 commit

  • The current ip_tunnel cache implementation is prone to a race
    that will cause the wrong dst to be cached on cuncurrent dst cache
    miss and ip tunnel update via netlink.

    Replacing with the generic implementation fix the issue.

    Signed-off-by: Paolo Abeni
    Suggested-and-acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Paolo Abeni
     

26 Jan, 2016

1 commit

  • The ESP algorithms using CBC mode require echainiv. Hence INET*_ESP have
    to select CRYPTO_ECHAINIV in order to work properly. This solves the
    issues caused by a misconfiguration as described in [1].
    The original approach, patching crypto/Kconfig was turned down by
    Herbert Xu [2].

    [1] https://lists.strongswan.org/pipermail/users/2015-December/009074.html
    [2] http://marc.info/?l=linux-crypto-vger&m=145224655809562&w=2

    Signed-off-by: Thomas Egerer
    Acked-by: Herbert Xu
    Signed-off-by: David S. Miller

    Thomas Egerer
     

16 Dec, 2015

1 commit

  • This implements SOCK_DESTROY for TCP sockets. It causes all
    blocking calls on the socket to fail fast with ECONNABORTED and
    causes a protocol close of the socket. It informs the other end
    of the connection by sending a RST, i.e., initiating a TCP ABORT
    as per RFC 793. ECONNABORTED was chosen for consistency with
    FreeBSD.

    Signed-off-by: Lorenzo Colitti
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Lorenzo Colitti
     

28 Aug, 2015

1 commit

  • geneve_core module handles send and receive functionality.
    This way OVS could use the Geneve API. Now with use of
    tunnel meatadata mode OVS can directly use Geneve netdevice.
    So there is no need for separate module for Geneve. Following
    patch consolidates Geneve protocol processing in single module.

    Signed-off-by: Pravin B Shelar
    Reviewed-by: Jesse Gross
    Acked-by: John W. Linville
    Signed-off-by: David S. Miller

    Pravin B Shelar
     

11 Jun, 2015

1 commit

  • CAIA Delay-Gradient (CDG) is a TCP congestion control that modifies
    the TCP sender in order to [1]:

    o Use the delay gradient as a congestion signal.
    o Back off with an average probability that is independent of the RTT.
    o Coexist with flows that use loss-based congestion control, i.e.,
    flows that are unresponsive to the delay signal.
    o Tolerate packet loss unrelated to congestion. (Disabled by default.)

    Its FreeBSD implementation was presented for the ICCRG in July 2012;
    slides are available at http://www.ietf.org/proceedings/84/iccrg.html

    Running the experiment scenarios in [1] suggests that our implementation
    achieves more goodput compared with FreeBSD 10.0 senders, although it also
    causes more queueing delay for a given backoff factor.

    The loss tolerance heuristic is disabled by default due to safety concerns
    for its use in the Internet [2, p. 45-46].

    We use a variant of the Hybrid Slow start algorithm in tcp_cubic to reduce
    the probability of slow start overshoot.

    [1] D.A. Hayes and G. Armitage. "Revisiting TCP congestion control using
    delay gradients." In Networking 2011, pages 328-341. Springer, 2011.
    [2] K.K. Jonassen. "Implementing CAIA Delay-Gradient in Linux."
    MSc thesis. Department of Informatics, University of Oslo, 2015.

    Cc: Eric Dumazet
    Cc: Yuchung Cheng
    Cc: Stephen Hemminger
    Cc: Neal Cardwell
    Cc: David Hayes
    Cc: Andreas Petlund
    Cc: Dave Taht
    Cc: Nicolas Kuhn
    Signed-off-by: Kenneth Klette Jonassen
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Kenneth Klette Jonassen
     

14 May, 2015

1 commit


06 Nov, 2014

1 commit

  • Move fou_build_header out of ip_tunnel.c and into fou.c splitting
    it up into fou_build_header, gue_build_header, and fou_build_udp.
    This allows for other users for TX of FOU or GUE. Change ip_tunnel_encap
    to call fou_build_header or gue_build_header based on the tunnel
    encapsulation type. Similarly, added fou_encap_hlen and gue_encap_hlen
    functions which are called by ip_encap_hlen. New net/fou.h has
    prototypes and defines for this.

    Added NET_FOU_IP_TUNNELS configuration. When this is set, IP tunnels
    can use FOU/GUE and fou module is also selected.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

07 Oct, 2014

1 commit

  • Fix a openvswitch compilation error when CONFIG_INET is not set:

    =====================================================
    In file included from include/net/geneve.h:4:0,
    from net/openvswitch/flow_netlink.c:45:
    include/net/udp_tunnel.h: In function 'udp_tunnel_handle_offloads':
    >> include/net/udp_tunnel.h:100:2: error: implicit declaration of function 'iptunnel_handle_offloads' [-Werror=implicit-function-declaration]
    >> return iptunnel_handle_offloads(skb, udp_csum, type);
    >> ^
    >> >> include/net/udp_tunnel.h:100:2: warning: return makes pointer from integer without a cast
    >> >> cc1: some warnings being treated as errors

    =====================================================

    Reported-by: kbuild test robot
    Signed-off-by: Andy Zhou
    Signed-off-by: David S. Miller

    Andy Zhou
     

06 Oct, 2014

1 commit

  • This adds a device level support for Geneve -- Generic Network
    Virtualization Encapsulation. The protocol is documented at
    http://tools.ietf.org/html/draft-gross-geneve-01

    Only protocol layer Geneve support is provided by this driver.
    Openvswitch can be used for configuring, set up and tear down
    functional Geneve tunnels.

    Signed-off-by: Jesse Gross
    Signed-off-by: Andy Zhou
    Signed-off-by: David S. Miller

    Andy Zhou
     

29 Sep, 2014

1 commit

  • This work adds the DataCenter TCP (DCTCP) congestion control
    algorithm [1], which has been first published at SIGCOMM 2010 [2],
    resp. follow-up analysis at SIGMETRICS 2011 [3] (and also, more
    recently as an informational IETF draft available at [4]).

    DCTCP is an enhancement to the TCP congestion control algorithm for
    data center networks. Typical data center workloads are i.e.
    i) partition/aggregate (queries; bursty, delay sensitive), ii) short
    messages e.g. 50KB-1MB (for coordination and control state; delay
    sensitive), and iii) large flows e.g. 1MB-100MB (data update;
    throughput sensitive). DCTCP has therefore been designed for such
    environments to provide/achieve the following three requirements:

    * High burst tolerance (incast due to partition/aggregate)
    * Low latency (short flows, queries)
    * High throughput (continuous data updates, large file
    transfers) with commodity, shallow buffered switches

    The basic idea of its design consists of two fundamentals: i) on the
    switch side, packets are being marked when its internal queue
    length > threshold K (K is chosen so that a large enough headroom
    for marked traffic is still available in the switch queue); ii) the
    sender/host side maintains a moving average of the fraction of marked
    packets, so each RTT, F is being updated as follows:

    F := X / Y, where X is # of marked ACKs, Y is total # of ACKs
    alpha := (1 - g) * alpha + g * F, where g is a smoothing constant

    The resulting alpha (iow: probability that switch queue is congested)
    is then being used in order to adaptively decrease the congestion
    window W:

    W := (1 - (alpha / 2)) * W

    The means for receiving marked packets resp. marking them on switch
    side in DCTCP is the use of ECN.

    RFC3168 describes a mechanism for using Explicit Congestion Notification
    from the switch for early detection of congestion, rather than waiting
    for segment loss to occur.

    However, this method only detects the presence of congestion, not
    the *extent*. In the presence of mild congestion, it reduces the TCP
    congestion window too aggressively and unnecessarily affects the
    throughput of long flows [4].

    DCTCP, as mentioned, enhances Explicit Congestion Notification (ECN)
    processing to estimate the fraction of bytes that encounter congestion,
    rather than simply detecting that some congestion has occurred. DCTCP
    then scales the TCP congestion window based on this estimate [4],
    thus it can derive multibit feedback from the information present in
    the single-bit sequence of marks in its control law. And thus act in
    *proportion* to the extent of congestion, not its *presence*.

    Switches therefore set the Congestion Experienced (CE) codepoint in
    packets when internal queue lengths exceed threshold K. Resulting,
    DCTCP delivers the same or better throughput than normal TCP, while
    using 90% less buffer space.

    It was found in [2] that DCTCP enables the applications to handle 10x
    the current background traffic, without impacting foreground traffic.
    Moreover, a 10x increase in foreground traffic did not cause any
    timeouts, and thus largely eliminates TCP incast collapse problems.

    The algorithm itself has already seen deployments in large production
    data centers since then.

    We did a long-term stress-test and analysis in a data center, short
    summary of our TCP incast tests with iperf compared to cubic:

    This test measured DCTCP throughput and latency and compared it with
    CUBIC throughput and latency for an incast scenario. In this test, 19
    senders sent at maximum rate to a single receiver. The receiver simply
    ran iperf -s.

    The senders ran iperf -c -t 30. All senders started
    simultaneously (using local clocks synchronized by ntp).

    This test was repeated multiple times. Below shows the results from a
    single test. Other tests are similar. (DCTCP results were extremely
    consistent, CUBIC results show some variance induced by the TCP timeouts
    that CUBIC encountered.)

    For this test, we report statistics on the number of TCP timeouts,
    flow throughput, and traffic latency.

    1) Timeouts (total over all flows, and per flow summaries):

    CUBIC DCTCP
    Total 3227 25
    Mean 169.842 1.316
    Median 183 1
    Max 207 5
    Min 123 0
    Stddev 28.991 1.600

    Timeout data is taken by measuring the net change in netstat -s
    "other TCP timeouts" reported. As a result, the timeout measurements
    above are not restricted to the test traffic, and we believe that it
    is likely that all of the "DCTCP timeouts" are actually timeouts for
    non-test traffic. We report them nevertheless. CUBIC will also include
    some non-test timeouts, but they are drawfed by bona fide test traffic
    timeouts for CUBIC. Clearly DCTCP does an excellent job of preventing
    TCP timeouts. DCTCP reduces timeouts by at least two orders of
    magnitude and may well have eliminated them in this scenario.

    2) Throughput (per flow in Mbps):

    CUBIC DCTCP
    Mean 521.684 521.895
    Median 464 523
    Max 776 527
    Min 403 519
    Stddev 105.891 2.601
    Fairness 0.962 0.999

    Throughput data was simply the average throughput for each flow
    reported by iperf. By avoiding TCP timeouts, DCTCP is able to
    achieve much better per-flow results. In CUBIC, many flows
    experience TCP timeouts which makes flow throughput unpredictable and
    unfair. DCTCP, on the other hand, provides very clean predictable
    throughput without incurring TCP timeouts. Thus, the standard deviation
    of CUBIC throughput is dramatically higher than the standard deviation
    of DCTCP throughput.

    Mean throughput is nearly identical because even though cubic flows
    suffer TCP timeouts, other flows will step in and fill the unused
    bandwidth. Note that this test is something of a best case scenario
    for incast under CUBIC: it allows other flows to fill in for flows
    experiencing a timeout. Under situations where the receiver is issuing
    requests and then waiting for all flows to complete, flows cannot fill
    in for timed out flows and throughput will drop dramatically.

    3) Latency (in ms):

    CUBIC DCTCP
    Mean 4.0088 0.04219
    Median 4.055 0.0395
    Max 4.2 0.085
    Min 3.32 0.028
    Stddev 0.1666 0.01064

    Latency for each protocol was computed by running "ping -i 0.2
    " from a single sender to the receiver during the incast
    test. For DCTCP, "ping -Q 0x6 -i 0.2 " was used to ensure
    that traffic traversed the DCTCP queue and was not dropped when the
    queue size was greater than the marking threshold. The summary
    statistics above are over all ping metrics measured between the single
    sender, receiver pair.

    The latency results for this test show a dramatic difference between
    CUBIC and DCTCP. CUBIC intentionally overflows the switch buffer
    which incurs the maximum queue latency (more buffer memory will lead
    to high latency.) DCTCP, on the other hand, deliberately attempts to
    keep queue occupancy low. The result is a two orders of magnitude
    reduction of latency with DCTCP - even with a switch with relatively
    little RAM. Switches with larger amounts of RAM will incur increasing
    amounts of latency for CUBIC, but not for DCTCP.

    4) Convergence and stability test:

    This test measured the time that DCTCP took to fairly redistribute
    bandwidth when a new flow commences. It also measured DCTCP's ability
    to remain stable at a fair bandwidth distribution. DCTCP is compared
    with CUBIC for this test.

    At the commencement of this test, a single flow is sending at maximum
    rate (near 10 Gbps) to a single receiver. One second after that first
    flow commences, a new flow from a distinct server begins sending to
    the same receiver as the first flow. After the second flow has sent
    data for 10 seconds, the second flow is terminated. The first flow
    sends for an additional second. Ideally, the bandwidth would be evenly
    shared as soon as the second flow starts, and recover as soon as it
    stops.

    The results of this test are shown below. Note that the flow bandwidth
    for the two flows was measured near the same time, but not
    simultaneously.

    DCTCP performs nearly perfectly within the measurement limitations
    of this test: bandwidth is quickly distributed fairly between the two
    flows, remains stable throughout the duration of the test, and
    recovers quickly. CUBIC, in contrast, is slow to divide the bandwidth
    fairly, and has trouble remaining stable.

    CUBIC DCTCP

    Seconds Flow 1 Flow 2 Seconds Flow 1 Flow 2
    0 9.93 0 0 9.92 0
    0.5 9.87 0 0.5 9.86 0
    1 8.73 2.25 1 6.46 4.88
    1.5 7.29 2.8 1.5 4.9 4.99
    2 6.96 3.1 2 4.92 4.94
    2.5 6.67 3.34 2.5 4.93 5
    3 6.39 3.57 3 4.92 4.99
    3.5 6.24 3.75 3.5 4.94 4.74
    4 6 3.94 4 5.34 4.71
    4.5 5.88 4.09 4.5 4.99 4.97
    5 5.27 4.98 5 4.83 5.01
    5.5 4.93 5.04 5.5 4.89 4.99
    6 4.9 4.99 6 4.92 5.04
    6.5 4.93 5.1 6.5 4.91 4.97
    7 4.28 5.8 7 4.97 4.97
    7.5 4.62 4.91 7.5 4.99 4.82
    8 5.05 4.45 8 5.16 4.76
    8.5 5.93 4.09 8.5 4.94 4.98
    9 5.73 4.2 9 4.92 5.02
    9.5 5.62 4.32 9.5 4.87 5.03
    10 6.12 3.2 10 4.91 5.01
    10.5 6.91 3.11 10.5 4.87 5.04
    11 8.48 0 11 8.49 4.94
    11.5 9.87 0 11.5 9.9 0

    SYN/ACK ECT test:

    This test demonstrates the importance of ECT on SYN and SYN-ACK packets
    by measuring the connection probability in the presence of competing
    flows for a DCTCP connection attempt *without* ECT in the SYN packet.
    The test was repeated five times for each number of competing flows.

    Competing Flows 1 | 2 | 4 | 8 | 16
    ------------------------------
    Mean Connection Probability 1 | 0.67 | 0.45 | 0.28 | 0
    Median Connection Probability 1 | 0.65 | 0.45 | 0.25 | 0

    As the number of competing flows moves beyond 1, the connection
    probability drops rapidly.

    Enabling DCTCP with this patch requires the following steps:

    DCTCP must be running both on the sender and receiver side in your
    data center, i.e.:

    sysctl -w net.ipv4.tcp_congestion_control=dctcp

    Also, ECN functionality must be enabled on all switches in your
    data center for DCTCP to work. The default ECN marking threshold (K)
    heuristic on the switch for DCTCP is e.g., 20 packets (30KB) at
    1Gbps, and 65 packets (~100KB) at 10Gbps (K > 1/7 * C * RTT, [4]).

    In above tests, for each switch port, traffic was segregated into two
    queues. For any packet with a DSCP of 0x01 - or equivalently a TOS of
    0x04 - the packet was placed into the DCTCP queue. All other packets
    were placed into the default drop-tail queue. For the DCTCP queue,
    RED/ECN marking was enabled, here, with a marking threshold of 75 KB.
    More details however, we refer you to the paper [2] under section 3).

    There are no code changes required to applications running in user
    space. DCTCP has been implemented in full *isolation* of the rest of
    the TCP code as its own congestion control module, so that it can run
    without a need to expose code to the core of the TCP stack, and thus
    nothing changes for non-DCTCP users.

    Changes in the CA framework code are minimal, and DCTCP algorithm
    operates on mechanisms that are already available in most Silicon.
    The gain (dctcp_shift_g) is currently a fixed constant (1/16) from
    the paper, but we leave the option that it can be chosen carefully
    to a different value by the user.

    In case DCTCP is being used and ECN support on peer site is off,
    DCTCP falls back after 3WHS to operate in normal TCP Reno mode.

    ss {-4,-6} -t -i diag interface:

    ... dctcp wscale:7,7 rto:203 rtt:2.349/0.026 mss:1448 cwnd:2054
    ssthresh:1102 ce_state 0 alpha 15 ab_ecn 0 ab_tot 735584
    send 10129.2Mbps pacing_rate 20254.1Mbps unacked:1822 retrans:0/15
    reordering:101 rcv_space:29200

    ... dctcp-reno wscale:7,7 rto:201 rtt:0.711/1.327 ato:40 mss:1448
    cwnd:10 ssthresh:1102 fallback_mode send 162.9Mbps pacing_rate
    325.5Mbps rcv_rtt:1.5 rcv_space:29200

    More information about DCTCP can be found in [1-4].

    [1] http://simula.stanford.edu/~alizade/Site/DCTCP.html
    [2] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
    [3] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
    [4] http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00

    Joint work with Florian Westphal and Glenn Judd.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Florian Westphal
    Signed-off-by: Glenn Judd
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

20 Sep, 2014

1 commit

  • This patch provides a receive path for foo-over-udp. This allows
    direct encapsulation of IP protocols over UDP. The bound destination
    port is used to map to an IP protocol, and the XFRM framework
    (udp_encap_rcv) is used to receive encapsulated packets. Upon
    reception, the encapsulation header is logically removed (pointer
    to transport header is advanced) and the packet is reinjected into
    the receive path with the IP protocol indicated by the mapping.

    Netlink is used to configure FOU ports. The configuration information
    includes the port number to bind to and the IP protocol corresponding
    to that port.

    This should support GRE/UDP
    (http://tools.ietf.org/html/draft-yong-tsvwg-gre-in-udp-encap-02),
    as will as the other IP tunneling protocols (IPIP, SIT).

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

15 Jul, 2014

1 commit


04 Sep, 2013

1 commit

  • This config option is superfluous in that it only guards a call
    to neigh_app_ns(). Enabling CONFIG_ARPD by default has no
    change in behavior. There will now be call to __neigh_notify()
    for each ARP resolution, which has no impact unless there is a
    user space daemon waiting to receive the notification, i.e.,
    the case for which CONFIG_ARPD was designed anyways.

    Suggested-by: Eric W. Biederman
    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: James Morris
    Cc: Hideaki YOSHIFUJI
    Cc: Patrick McHardy
    Cc: "Eric W. Biederman"
    Cc: Gao feng
    Cc: Joe Perches
    Cc: Veaceslav Falico
    Signed-off-by: Tim Gardner
    Reviewed-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Tim Gardner
     

05 Jun, 2013

1 commit

  • Commit 202dc3fc599c1dded235d3b448d9ca924252e354 (Documentation: remove
    obsolete networking/multicast.txt file) deleted the obsolete file. After
    the file has been removed, clean up a couple of places where references
    to the deleted file were made so that users wouldn't be confused when
    they consult the Help menu.

    Signed-off-by: Jean Sacren
    Signed-off-by: David S. Miller

    Jean Sacren
     

27 Mar, 2013

3 commits

  • Use common function get calculate rtnl_link_stats64 stats.

    Signed-off-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Pravin B Shelar
     
  • Reuse common ip-tunneling code which is re-factored from GRE
    module.

    Signed-off-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Pravin B Shelar
     
  • Following patch refactors GRE code into ip tunneling code and GRE
    specific code. Common tunneling code is moved to ip_tunnel module.
    ip_tunnel module is written as generic library which can be used
    by different tunneling implementations.

    ip_tunnel module contains following components:
    - packet xmit and rcv generic code. xmit flow looks like
    (gre_xmit/ipip_xmit)->ip_tunnel_xmit->ip_local_out.
    - hash table of all devices.
    - lookup for tunnel devices.
    - control plane operations like device create, destroy, ioctl, netlink
    operations code.
    - registration for tunneling modules, like gre, ipip etc.
    - define single pcpu_tstats dev->tstats.
    - struct tnl_ptk_info added to pass parsed tunnel packet parameters.

    ipip.h header is renamed to ip_tunnel.h

    Signed-off-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Pravin B Shelar
     

12 Jan, 2013

1 commit

  • The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
    while now and is almost always enabled by default. As agreed during the
    Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

    CC: "David S. Miller"
    CC: Alexey Kuznetsov
    CC: James Morris
    CC: Hideaki YOSHIFUJI
    CC: Patrick McHardy
    Signed-off-by: Kees Cook
    Acked-by: David S. Miller

    Kees Cook
     

19 Jul, 2012

1 commit


16 May, 2012

2 commits

  • We are going to delete the Token ring support. This removes any
    special processing in the core networking for token ring, (aside
    from net/tr.c itself), leaving the drivers and remaining tokenring
    support present but inert.

    The mass removal of the drivers and net/tr.c will be in a separate
    commit, so that the history of these files that we still care
    about won't have the giant deletion tied into their history.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     
  • By making this a standalone config option (auto-selected as needed),
    selecting CRYPTO from here rather than from XFRM (which is boolean)
    allows the core crypto code to become a module again even when XFRM=y.

    Signed-off-by: Jan Beulich
    Signed-off-by: David S. Miller

    Jan Beulich
     

08 Feb, 2012

1 commit


10 Jan, 2012

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
    igmp: Avoid zero delay when receiving odd mixture of IGMP queries
    netdev: make net_device_ops const
    bcm63xx: make ethtool_ops const
    usbnet: make ethtool_ops const
    net: Fix build with INET disabled.
    net: introduce netif_addr_lock_nested() and call if when appropriate
    net: correct lock name in dev_[uc/mc]_sync documentations.
    net: sk_update_clone is only used in net/core/sock.c
    8139cp: fix missing napi_gro_flush.
    pktgen: set correct max and min in pktgen_setup_inject()
    smsc911x: Unconditionally include linux/smscphy.h in smsc911x.h
    asix: fix infinite loop in rx_fixup()
    net: Default UDP and UNIX diag to 'n'.
    r6040: fix typo in use of MCR0 register bits
    net: fix sock_clone reference mismatch with tcp memcontrol

    Linus Torvalds
     

09 Jan, 2012

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (53 commits)
    Kconfig: acpi: Fix typo in comment.
    misc latin1 to utf8 conversions
    devres: Fix a typo in devm_kfree comment
    btrfs: free-space-cache.c: remove extra semicolon.
    fat: Spelling s/obsolate/obsolete/g
    SCSI, pmcraid: Fix spelling error in a pmcraid_err() call
    tools/power turbostat: update fields in manpage
    mac80211: drop spelling fix
    types.h: fix comment spelling for 'architectures'
    typo fixes: aera -> area, exntension -> extension
    devices.txt: Fix typo of 'VMware'.
    sis900: Fix enum typo 'sis900_rx_bufer_status'
    decompress_bunzip2: remove invalid vi modeline
    treewide: Fix comment and string typo 'bufer'
    hyper-v: Update MAINTAINERS
    treewide: Fix typos in various parts of the kernel, and fix some comments.
    clockevents: drop unknown Kconfig symbol GENERIC_CLOCKEVENTS_MIGR
    gpio: Kconfig: drop unknown symbol 'CS5535_GPIO'
    leds: Kconfig: Fix typo 'D2NET_V2'
    sound: Kconfig: drop unknown symbol ARCH_CLPS7500
    ...

    Fix up trivial conflicts in arch/powerpc/platforms/40x/Kconfig (some new
    kconfig additions, close to removed commented-out old ones)

    Linus Torvalds
     

08 Jan, 2012

1 commit


11 Dec, 2011

1 commit

  • Eric Dumazet reported, that when inet_diag is built-in the udp_diag also goes
    built-in and when ipv6 is a module the udp6 lookup symbol is not found.

    LD .tmp_vmlinux1
    net/built-in.o: In function `udp_dump_one':
    udp_diag.c:(.text+0xa2b40): undefined reference to `__udp6_lib_lookup'
    make: *** [.tmp_vmlinux1] Erreur 1

    Fix this by making udp diag build mode depend on both -- inet diag and ipv6.

    Reported-by: Eric Dumazet
    Signed-off-by: Pavel Emelyanov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

10 Dec, 2011

1 commit


31 Oct, 2011

1 commit

  • These comments mention CONFIG options that do not exist: not as a symbol
    in a Kconfig file (without the CONFIG_ prefix) and neither as a symbol
    (with that prefix) in the code.

    There's one reference to XSCALE_PMU_TIMER as a negative dependency.
    But XSCALE_PMU_TIMER is never defined (CONFIG_XSCALE_PMU_TIMER is
    also unused in the code). It shows up with type "unknown" if you search
    for it in menuconfig. Apparently a negative dependency on an unknown
    symbol is always true. That negative dependency can be removed too.

    Signed-off-by: Paul Bolle
    Signed-off-by: Jiri Kosina

    Paul Bolle
     

02 Feb, 2011

1 commit

  • The time has finally come to remove the hash based routing table
    implementation in ipv4.

    FIB Trie is mature, well tested, and I've done an audit of it's code
    to confirm that it implements insert, delete, and lookup with the same
    identical semantics as fib_hash did.

    If there are any semantic differences found in fib_trie, we should
    simply fix them.

    I've placed the trie statistic config option under advanced router
    configuration.

    Signed-off-by: David S. Miller
    Acked-by: Stephen Hemminger

    David S. Miller
     

20 Jan, 2011

1 commit


14 Jan, 2011

1 commit

  • Fix dependencies of netfilter realm match: it depends on NET_CLS_ROUTE,
    which itself depends on NET_SCHED; this dependency is missing from netfilter.

    Since matching on realms is also useful without having NET_SCHED enabled and
    the option really only controls whether the tclassid member is included in
    route and dst entries, rename the config option to IP_ROUTE_CLASSID and move
    it outside of traffic scheduling context to get rid of the NET_SCHED dependeny.

    Reported-by: Vladis Kletnieks
    Signed-off-by: Patrick McHardy

    Patrick McHardy
     

16 Nov, 2010

1 commit

  • Some of the documentation refers to web pages under
    the domain `osdl.org'. However, `osdl.org' now
    redirects to `linuxfoundation.org'.

    Rather than rely on redirections, this patch updates
    the addresses appropriately; for the most part, only
    documentation that is meant to be current has been
    updated.

    The patch should be pretty quick to scan and check;
    each new web-page url was gotten by trying out the
    original URL in a browser and then simply copying the
    the redirected URL (formatting as necessary).

    There is some conflict as to which one of these domain
    names is preferred:

    linuxfoundation.org
    linux-foundation.org

    So, I wrote:

    info@linuxfoundation.org

    and got this reply:

    Message-ID:
    Date: Mon, 15 Nov 2010 10:41:42 -0800
    From: David Ames

    ...

    linuxfoundation.org is preferred. The canonical name for our web site is
    www.linuxfoundation.org. Our list site is actually
    lists.linux-foundation.org.

    Regarding email linuxfoundation.org is preferred there are a few people
    who choose to use linux-foundation.org for their own reasons.

    Consequently, I used `linuxfoundation.org' for web pages and
    `lists.linux-foundation.org' for mailing-list web pages and email addresses;
    the only personal email address I updated from `@osdl.org' was that of
    Andrew Morton, who prefers `linux-foundation.org' according `git log'.

    Signed-off-by: Michael Witten
    Signed-off-by: Jiri Kosina

    Michael Witten
     

25 Oct, 2010

1 commit

  • * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
    Update broken web addresses in arch directory.
    Update broken web addresses in the kernel.
    Revert "drivers/usb: Remove unnecessary return's from void functions" for musb gadget
    Revert "Fix typo: configuation => configuration" partially
    ida: document IDA_BITMAP_LONGS calculation
    ext2: fix a typo on comment in ext2/inode.c
    drivers/scsi: Remove unnecessary casts of private_data
    drivers/s390: Remove unnecessary casts of private_data
    net/sunrpc/rpc_pipe.c: Remove unnecessary casts of private_data
    drivers/infiniband: Remove unnecessary casts of private_data
    drivers/gpu/drm: Remove unnecessary casts of private_data
    kernel/pm_qos_params.c: Remove unnecessary casts of private_data
    fs/ecryptfs: Remove unnecessary casts of private_data
    fs/seq_file.c: Remove unnecessary casts of private_data
    arm: uengine.c: remove C99 comments
    arm: scoop.c: remove C99 comments
    Fix typo configue => configure in comments
    Fix typo: configuation => configuration
    Fix typo interrest[ing|ed] => interest[ing|ed]
    Fix various typos of valid in comments
    ...

    Fix up trivial conflicts in:
    drivers/char/ipmi/ipmi_si_intf.c
    drivers/usb/gadget/rndis.c
    net/irda/irnet/irnet_ppp.c

    Linus Torvalds
     

18 Oct, 2010

1 commit

  • The patch below updates broken web addresses in the kernel

    Signed-off-by: Justin P. Mattock
    Cc: Maciej W. Rozycki
    Cc: Geert Uytterhoeven
    Cc: Finn Thain
    Cc: Randy Dunlap
    Cc: Matt Turner
    Cc: Dimitry Torokhov
    Cc: Mike Frysinger
    Acked-by: Ben Pfaff
    Acked-by: Hans J. Koch
    Reviewed-by: Finn Thain
    Signed-off-by: Jiri Kosina

    Justin P. Mattock
     

05 Oct, 2010

1 commit


04 Oct, 2010

1 commit

  • This reverts commit e81963b180ac502fda0326edf059b1e29cdef1a2.

    LRO is now deprecated in favour of GRO, and only a few drivers use it,
    so it is desirable to build it as a module in distribution kernels.

    The original change to prevent building it as a module was made in an
    attempt to avoid the case where some dependents are set to y and some
    to m, and INET_LRO can be set to m rather than y. However, the
    Kconfig system will reliably set INET_LRO=y in this case.

    Signed-off-by: Ben Hutchings
    Signed-off-by: David S. Miller

    Ben Hutchings