16 Oct, 2018

1 commit

  • Add a generic sk_msg layer, and convert current sockmap and later
    kTLS over to make use of it. While sk_buff handles network packet
    representation from netdevice up to socket, sk_msg handles data
    representation from application to socket layer.

    This means that sk_msg framework spans across ULP users in the
    kernel, and enables features such as introspection or filtering
    of data with the help of BPF programs that operate on this data
    structure.

    Latter becomes in particular useful for kTLS where data encryption
    is deferred into the kernel, and as such enabling the kernel to
    perform L7 introspection and policy based on BPF for TLS connections
    where the record is being encrypted after BPF has run and came to
    a verdict. In order to get there, first step is to transform open
    coding of scatter-gather list handling into a common core framework
    that subsystems can use.

    The code itself has been split and refactored into three bigger
    pieces: i) the generic sk_msg API which deals with managing the
    scatter gather ring, providing helpers for walking and mangling,
    transferring application data from user space into it, and preparing
    it for BPF pre/post-processing, ii) the plain sock map itself
    where sockets can be attached to or detached from; these bits
    are independent of i) which can now be used also without sock
    map, and iii) the integration with plain TCP as one protocol
    to be used for processing L7 application data (later this could
    e.g. also be extended to other protocols like UDP). The semantics
    are the same with the old sock map code and therefore no change
    of user facing behavior or APIs. While pursuing this work it
    also helped finding a number of bugs in the old sockmap code
    that we've fixed already in earlier commits. The test_sockmap
    kselftest suite passes through fine as well.

    Joint work with John.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: John Fastabend
    Signed-off-by: Alexei Starovoitov

    Daniel Borkmann
     

25 Jul, 2018

1 commit


24 May, 2018

2 commits

  • This is a followup to fib rules sport, dport and ipproto
    match support. Only supports tcp, udp and icmp for ipproto.
    Used by fib rule self tests.

    Signed-off-by: Roopa Prabhu
    Signed-off-by: David S. Miller

    Roopa Prabhu
     
  • bpfilter.ko consists of bpfilter_kern.c (normal kernel module code)
    and user mode helper code that is embedded into bpfilter.ko

    The steps to build bpfilter.ko are the following:
    - main.c is compiled by HOSTCC into the bpfilter_umh elf executable file
    - with quite a bit of objcopy and Makefile magic the bpfilter_umh elf file
    is converted into bpfilter_umh.o object file
    with _binary_net_bpfilter_bpfilter_umh_start and _end symbols
    Example:
    $ nm ./bld_x64/net/bpfilter/bpfilter_umh.o
    0000000000004cf8 T _binary_net_bpfilter_bpfilter_umh_end
    0000000000004cf8 A _binary_net_bpfilter_bpfilter_umh_size
    0000000000000000 T _binary_net_bpfilter_bpfilter_umh_start
    - bpfilter_umh.o and bpfilter_kern.o are linked together into bpfilter.ko

    bpfilter_kern.c is a normal kernel module code that calls
    the fork_usermode_blob() helper to execute part of its own data
    as a user mode process.

    Notice that _binary_net_bpfilter_bpfilter_umh_start - end
    is placed into .init.rodata section, so it's freed as soon as __init
    function of bpfilter.ko is finished.
    As part of __init the bpfilter.ko does first request/reply action
    via two unix pipe provided by fork_usermode_blob() helper to
    make sure that umh is healthy. If not it will kill it via pid.

    Later bpfilter_process_sockopt() will be called from bpfilter hooks
    in get/setsockopt() to pass iptable commands into umh via bpfilter.ko

    If admin does 'rmmod bpfilter' the __exit code bpfilter.ko will
    kill umh as well.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

18 Apr, 2018

1 commit

  • Move logic of fib_convert_metrics into ip_metrics_convert. This allows
    the code that converts netlink attributes into metrics struct to be
    re-used in a later patch by IPv6.

    This is mostly a code move with the following changes to variable names:
    - fi->fib_net becomes net
    - fc_mx and fc_mx_len are passed as inputs pulled from fib_config
    - metrics array is passed as an input from fi->fib_metrics->metrics

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

02 Mar, 2018

1 commit

  • The two implementations have almost identical structures - vif_device and
    mif_device. As a step toward uniforming the mr_tables, eliminate the
    mif_device and relocate the vif_device definition into a new common
    header file.

    Also, introduce a common initializing function for setting most of the
    vif_device fields in a new common source file. This requires modifying
    the ipv{4,6] Kconfig and ipv4 makefile as we're introducing a new common
    config option - CONFIG_IP_MROUTE_COMMON.

    Signed-off-by: Yuval Mintz
    Acked-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Yuval Mintz
     

03 Jan, 2018

1 commit


02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

16 Jun, 2017

1 commit

  • Add the infrustructure for attaching Upper Layer Protocols (ULPs) over TCP
    sockets. Based on a similar infrastructure in tcp_cong. The idea is that any
    ULP can add its own logic by changing the TCP proto_ops structure to its own
    methods.

    Example usage:

    setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls"));

    modules will call:
    tcp_register_ulp(&tcp_tls_ulp_ops);

    to register/unregister their ulp, with an init function and name.

    A list of registered ulps will be returned by tcp_get_available_ulp, which is
    hooked up to /proc. Example:

    $ cat /proc/sys/net/ipv4/tcp_available_ulp
    tls

    There is currently no functionality to remove or chain ULPs, but
    it should be possible to add these in the future if needed.

    Signed-off-by: Boris Pismenny
    Signed-off-by: Dave Watson
    Signed-off-by: David S. Miller

    Dave Watson
     

11 Mar, 2017

1 commit

  • Most of the code concerned with the FIB notification chain currently
    resides in fib_trie.c, but this isn't really appropriate, as the FIB
    notification chain is also used for FIB rules.

    Therefore, it makes sense to move the common FIB notification code to a
    separate file and have it export the relevant functions, which can be
    invoked by its different users (e.g., fib_trie.c, fib_rules.c).

    Signed-off-by: Ido Schimmel
    Signed-off-by: Jiri Pirko
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Ido Schimmel
     

15 Feb, 2017

1 commit

  • This patch adds GRO ifrastructure and callbacks for ESP on
    ipv4 and ipv6.

    In case the GRO layer detects an ESP packet, the
    esp{4,6}_gro_receive() function does a xfrm state lookup
    and calls the xfrm input layer if it finds a matching state.
    The packet will be decapsulated and reinjected it into layer 2.

    Signed-off-by: Steffen Klassert

    Steffen Klassert
     

24 Oct, 2016

1 commit

  • In criu we are actively using diag interface to collect sockets
    present in the system when dumping applications. And while for
    unix, tcp, udp[lite], packet, netlink it works as expected,
    the raw sockets do not have. Thus add it.

    v2:
    - add missing sock_put calls in raw_diag_dump_one (by eric.dumazet@)
    - implement @destroy for diag requests (by dsa@)

    v3:
    - add export of raw_abort for IPv6 (by dsa@)
    - pass net-admin flag into inet_sk_diag_fill due to
    changes in net-next branch (by dsa@)

    v4:
    - use @pad in struct inet_diag_req_v2 for raw socket
    protocol specification: raw module carries sockets
    which may have custom protocol passed from socket()
    syscall and sole @sdiag_protocol is not enough to
    match underlied ones
    - start reporting protocol specifed in socket() call
    when sockets are raw ones for the same reason: user
    space tools like ss may parse this attribute and use
    it for socket matching

    v5 (by eric.dumazet@):
    - use sock_hold in raw_sock_get instead of atomic_inc,
    we're holding (raw_v4_hashinfo|raw_v6_hashinfo)->lock
    when looking up so counter won't be zero here.

    v6:
    - use sdiag_raw_protocol() helper which will access @pad
    structure used for raw sockets protocol specification:
    we can't simply rename this member without breaking uapi

    v7:
    - sine sdiag_raw_protocol() helper is not suitable for
    uapi lets rather make an alias structure with proper
    names. __check_inet_diag_req_raw helper will catch
    if any of structure unintentionally changed.

    CC: David S. Miller
    CC: Eric Dumazet
    CC: David Ahern
    CC: Alexey Kuznetsov
    CC: James Morris
    CC: Hideaki YOSHIFUJI
    CC: Patrick McHardy
    CC: Andrey Vagin
    CC: Stephen Hemminger
    Signed-off-by: Cyrill Gorcunov
    Signed-off-by: David S. Miller

    Cyrill Gorcunov
     

21 Sep, 2016

2 commits

  • This commit implements a new TCP congestion control algorithm: BBR
    (Bottleneck Bandwidth and RTT). A detailed description of BBR will be
    published in ACM Queue, Vol. 14 No. 5, September-October 2016, as
    "BBR: Congestion-Based Congestion Control".

    BBR has significantly increased throughput and reduced latency for
    connections on Google's internal backbone networks and google.com and
    YouTube Web servers.

    BBR requires only changes on the sender side, not in the network or
    the receiver side. Thus it can be incrementally deployed on today's
    Internet, or in datacenters.

    The Internet has predominantly used loss-based congestion control
    (largely Reno or CUBIC) since the 1980s, relying on packet loss as the
    signal to slow down. While this worked well for many years, loss-based
    congestion control is unfortunately out-dated in today's networks. On
    today's Internet, loss-based congestion control causes the infamous
    bufferbloat problem, often causing seconds of needless queuing delay,
    since it fills the bloated buffers in many last-mile links. On today's
    high-speed long-haul links using commodity switches with shallow
    buffers, loss-based congestion control has abysmal throughput because
    it over-reacts to losses caused by transient traffic bursts.

    In 1981 Kleinrock and Gale showed that the optimal operating point for
    a network maximizes delivered bandwidth while minimizing delay and
    loss, not only for single connections but for the network as a
    whole. Finding that optimal operating point has been elusive, since
    any single network measurement is ambiguous: network measurements are
    the result of both bandwidth and propagation delay, and those two
    cannot be measured simultaneously.

    While it is impossible to disambiguate any single bandwidth or RTT
    measurement, a connection's behavior over time tells a clearer
    story. BBR uses a measurement strategy designed to resolve this
    ambiguity. It combines these measurements with a robust servo loop
    using recent control systems advances to implement a distributed
    congestion control algorithm that reacts to actual congestion, not
    packet loss or transient queue delay, and is designed to converge with
    high probability to a point near the optimal operating point.

    In a nutshell, BBR creates an explicit model of the network pipe by
    sequentially probing the bottleneck bandwidth and RTT. On the arrival
    of each ACK, BBR derives the current delivery rate of the last round
    trip, and feeds it through a windowed max-filter to estimate the
    bottleneck bandwidth. Conversely it uses a windowed min-filter to
    estimate the round trip propagation delay. The max-filtered bandwidth
    and min-filtered RTT estimates form BBR's model of the network pipe.

    Using its model, BBR sets control parameters to govern sending
    behavior. The primary control is the pacing rate: BBR applies a gain
    multiplier to transmit faster or slower than the observed bottleneck
    bandwidth. The conventional congestion window (cwnd) is now the
    secondary control; the cwnd is set to a small multiple of the
    estimated BDP (bandwidth-delay product) in order to allow full
    utilization and bandwidth probing while bounding the potential amount
    of queue at the bottleneck.

    When a BBR connection starts, it enters STARTUP mode and applies a
    high gain to perform an exponential search to quickly probe the
    bottleneck bandwidth (doubling its sending rate each round trip, like
    slow start). However, instead of continuing until it fills up the
    buffer (i.e. a loss), or until delay or ACK spacing reaches some
    threshold (like Hystart), it uses its model of the pipe to estimate
    when that pipe is full: it estimates the pipe is full when it notices
    the estimated bandwidth has stopped growing. At that point it exits
    STARTUP and enters DRAIN mode, where it reduces its pacing rate to
    drain the queue it estimates it has created.

    Then BBR enters steady state. In steady state, PROBE_BW mode cycles
    between first pacing faster to probe for more bandwidth, then pacing
    slower to drain any queue that created if no more bandwidth was
    available, and then cruising at the estimated bandwidth to utilize the
    pipe without creating excess queue. Occasionally, on an as-needed
    basis, it sends significantly slower to probe for RTT (PROBE_RTT
    mode).

    BBR has been fully deployed on Google's wide-area backbone networks
    and we're experimenting with BBR on Google.com and YouTube on a global
    scale. Replacing CUBIC with BBR has resulted in significant
    improvements in network latency and application (RPC, browser, and
    video) metrics. For more details please refer to our upcoming ACM
    Queue publication.

    Example performance results, to illustrate the difference between BBR
    and CUBIC:

    Resilience to random loss (e.g. from shallow buffers):
    Consider a netperf TCP_STREAM test lasting 30 secs on an emulated
    path with a 10Gbps bottleneck, 100ms RTT, and 1% packet loss
    rate. CUBIC gets 3.27 Mbps, and BBR gets 9150 Mbps (2798x higher).

    Low latency with the bloated buffers common in today's last-mile links:
    Consider a netperf TCP_STREAM test lasting 120 secs on an emulated
    path with a 10Mbps bottleneck, 40ms RTT, and 1000-packet bottleneck
    buffer. Both fully utilize the bottleneck bandwidth, but BBR
    achieves this with a median RTT 25x lower (43 ms instead of 1.09
    secs).

    Our long-term goal is to improve the congestion control algorithms
    used on the Internet. We are hopeful that BBR can help advance the
    efforts toward this goal, and motivate the community to do further
    research.

    Test results, performance evaluations, feedback, and BBR-related
    discussions are very welcome in the public e-mail list for BBR:

    https://groups.google.com/forum/#!forum/bbr-dev

    NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing
    enabled, since pacing is integral to the BBR design and
    implementation. BBR without pacing would not function properly, and
    may incur unnecessary high packet loss rates.

    Signed-off-by: Van Jacobson
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Nandita Dukkipati
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • This patch generates data delivery rate (throughput) samples on a
    per-ACK basis. These rate samples can be used by congestion control
    modules, and specifically will be used by TCP BBR in later patches in
    this series.

    Key state:

    tp->delivered: Tracks the total number of data packets (original or not)
    delivered so far. This is an already-existing field.

    tp->delivered_mstamp: the last time tp->delivered was updated.

    Algorithm:

    A rate sample is calculated as (d1 - d0)/(t1 - t0) on a per-ACK basis:

    d1: the current tp->delivered after processing the ACK
    t1: the current time after processing the ACK

    d0: the prior tp->delivered when the acked skb was transmitted
    t0: the prior tp->delivered_mstamp when the acked skb was transmitted

    When an skb is transmitted, we snapshot d0 and t0 in its control
    block in tcp_rate_skb_sent().

    When an ACK arrives, it may SACK and ACK some skbs. For each SACKed
    or ACKed skb, tcp_rate_skb_delivered() updates the rate_sample struct
    to reflect the latest (d0, t0).

    Finally, tcp_rate_gen() generates a rate sample by storing
    (d1 - d0) in rs->delivered and (t1 - t0) in rs->interval_us.

    One caveat: if an skb was sent with no packets in flight, then
    tp->delivered_mstamp may be either invalid (if the connection is
    starting) or outdated (if the connection was idle). In that case,
    we'll re-stamp tp->delivered_mstamp.

    At first glance it seems t0 should always be the time when an skb was
    transmitted, but actually this could over-estimate the rate due to
    phase mismatch between transmit and ACK events. To track the delivery
    rate, we ensure that if packets are in flight then t0 and and t1 are
    times at which packets were marked delivered.

    If the initial and final RTTs are different then one may be corrupted
    by some sort of noise. The noise we see most often is sending gaps
    caused by delayed, compressed, or stretched acks. This either affects
    both RTTs equally or artificially reduces the final RTT. We approach
    this by recording the info we need to compute the initial RTT
    (duration of the "send phase" of the window) when we recorded the
    associated inflight. Then, for a filter to avoid bandwidth
    overestimates, we generalize the per-sample bandwidth computation
    from:

    bw = delivered / ack_phase_rtt

    to the following:

    bw = delivered / max(send_phase_rtt, ack_phase_rtt)

    In large-scale experiments, this filtering approach incorporating
    send_phase_rtt is effective at avoiding bandwidth overestimates due to
    ACK compression or stretched ACKs.

    Signed-off-by: Van Jacobson
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Nandita Dukkipati
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

11 Jun, 2016

1 commit

  • TCP-NV (New Vegas) is a major update to TCP-Vegas.
    An earlier version of NV was presented at 2010's LPC.
    It is a delayed based congestion avoidance for the
    data center. This version has been tested within a
    10G rack where the HW RTTs are 20-50us and with
    1 to 400 flows.

    A description of TCP-NV, including implementation
    details as well as experimental results, can be found at:
    http://www.brakmo.org/networking/tcp-nv/TCPNV.html

    Signed-off-by: Lawrence Brakmo
    Signed-off-by: David S. Miller

    Lawrence Brakmo
     

18 Feb, 2016

1 commit


21 Jan, 2016

2 commits

  • tcp_memcontrol.c only contains legacy memory.tcp.kmem.* file definitions
    and mem_cgroup->tcp_mem init/destroy stuff. This doesn't belong to
    network subsys. Let's move it to memcontrol.c. This also allows us to
    reuse generic code for handling legacy memcg files.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: "David S. Miller"
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Let the user know that CONFIG_MEMCG_KMEM does not apply to the cgroup2
    interface. This also makes legacy-only code sections stand out better.

    [arnd@arndb.de: mm: memcontrol: only manage socket pressure for CONFIG_INET]
    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Acked-by: Vladimir Davydov
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

21 Oct, 2015

1 commit

  • This patch is the first half of the RACK loss recovery.

    RACK loss recovery uses the notion of time instead
    of packet sequence (FACK) or counts (dupthresh). It's inspired by the
    previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
    transmit (new data packet) is sacked, then current retransmitted
    sequence below the newly sacked sequence must been lost,
    since at least one round trip time has elapsed.

    But it has several limitations:
    1) can't detect tail drops since it depends on limited transmit
    2) is disabled upon reordering (assumes no reordering)
    3) only enabled in fast recovery ut not timeout recovery

    RACK (Recently ACK) addresses these limitations with the notion
    of time instead: a packet P1 is lost if a later packet P2 is s/acked,
    as at least one round trip has passed.

    Since RACK cares about the time sequence instead of the data sequence
    of packets, it can detect tail drops when later retransmission is
    s/acked while FACK or dupthresh can't. For reordering RACK uses a
    dynamically adjusted reordering window ("reo_wnd") to reduce false
    positives on ever (small) degree of reordering.

    This patch implements tcp_advanced_rack() which tracks the
    most recent transmission time among the packets that have been
    delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
    is the key to determine which packet has been lost.

    Consider an example that the sender sends six packets:
    T1: P1 (lost)
    T2: P2
    T3: P3
    T4: P4
    T100: sack of P2. rack.mstamp = T2
    T101: retransmit P1
    T102: sack of P2,P3,P4. rack.mstamp = T4
    T205: ACK of P4 since the hole is repaired. rack.mstamp = T101

    We need to be careful about spurious retransmission because it may
    falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
    to falsely mark all packets lost, just like a spurious timeout.

    We identify spurious retransmission by the ACK's TS echo value.
    If TS option is not applicable but the retransmission is acknowledged
    less than min-RTT ago, it is likely to be spurious. We refrain from
    using the transmission time of these spurious retransmissions.

    The second half is implemented in the next patch that marks packet
    lost using RACK timestamp.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

28 Aug, 2015

1 commit

  • geneve_core module handles send and receive functionality.
    This way OVS could use the Geneve API. Now with use of
    tunnel meatadata mode OVS can directly use Geneve netdevice.
    So there is no need for separate module for Geneve. Following
    patch consolidates Geneve protocol processing in single module.

    Signed-off-by: Pravin B Shelar
    Reviewed-by: Jesse Gross
    Acked-by: John W. Linville
    Signed-off-by: David S. Miller

    Pravin B Shelar
     

11 Jun, 2015

1 commit

  • CAIA Delay-Gradient (CDG) is a TCP congestion control that modifies
    the TCP sender in order to [1]:

    o Use the delay gradient as a congestion signal.
    o Back off with an average probability that is independent of the RTT.
    o Coexist with flows that use loss-based congestion control, i.e.,
    flows that are unresponsive to the delay signal.
    o Tolerate packet loss unrelated to congestion. (Disabled by default.)

    Its FreeBSD implementation was presented for the ICCRG in July 2012;
    slides are available at http://www.ietf.org/proceedings/84/iccrg.html

    Running the experiment scenarios in [1] suggests that our implementation
    achieves more goodput compared with FreeBSD 10.0 senders, although it also
    causes more queueing delay for a given backoff factor.

    The loss tolerance heuristic is disabled by default due to safety concerns
    for its use in the Internet [2, p. 45-46].

    We use a variant of the Hybrid Slow start algorithm in tcp_cubic to reduce
    the probability of slow start overshoot.

    [1] D.A. Hayes and G. Armitage. "Revisiting TCP congestion control using
    delay gradients." In Networking 2011, pages 328-341. Springer, 2011.
    [2] K.K. Jonassen. "Implementing CAIA Delay-Gradient in Linux."
    MSc thesis. Department of Informatics, University of Oslo, 2015.

    Cc: Eric Dumazet
    Cc: Yuchung Cheng
    Cc: Stephen Hemminger
    Cc: Neal Cardwell
    Cc: David Hayes
    Cc: Andreas Petlund
    Cc: Dave Taht
    Cc: Nicolas Kuhn
    Signed-off-by: Kenneth Klette Jonassen
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Kenneth Klette Jonassen
     

14 May, 2015

1 commit


06 Oct, 2014

1 commit

  • This adds a device level support for Geneve -- Generic Network
    Virtualization Encapsulation. The protocol is documented at
    http://tools.ietf.org/html/draft-gross-geneve-01

    Only protocol layer Geneve support is provided by this driver.
    Openvswitch can be used for configuring, set up and tear down
    functional Geneve tunnels.

    Signed-off-by: Jesse Gross
    Signed-off-by: Andy Zhou
    Signed-off-by: David S. Miller

    Andy Zhou
     

29 Sep, 2014

1 commit

  • This work adds the DataCenter TCP (DCTCP) congestion control
    algorithm [1], which has been first published at SIGCOMM 2010 [2],
    resp. follow-up analysis at SIGMETRICS 2011 [3] (and also, more
    recently as an informational IETF draft available at [4]).

    DCTCP is an enhancement to the TCP congestion control algorithm for
    data center networks. Typical data center workloads are i.e.
    i) partition/aggregate (queries; bursty, delay sensitive), ii) short
    messages e.g. 50KB-1MB (for coordination and control state; delay
    sensitive), and iii) large flows e.g. 1MB-100MB (data update;
    throughput sensitive). DCTCP has therefore been designed for such
    environments to provide/achieve the following three requirements:

    * High burst tolerance (incast due to partition/aggregate)
    * Low latency (short flows, queries)
    * High throughput (continuous data updates, large file
    transfers) with commodity, shallow buffered switches

    The basic idea of its design consists of two fundamentals: i) on the
    switch side, packets are being marked when its internal queue
    length > threshold K (K is chosen so that a large enough headroom
    for marked traffic is still available in the switch queue); ii) the
    sender/host side maintains a moving average of the fraction of marked
    packets, so each RTT, F is being updated as follows:

    F := X / Y, where X is # of marked ACKs, Y is total # of ACKs
    alpha := (1 - g) * alpha + g * F, where g is a smoothing constant

    The resulting alpha (iow: probability that switch queue is congested)
    is then being used in order to adaptively decrease the congestion
    window W:

    W := (1 - (alpha / 2)) * W

    The means for receiving marked packets resp. marking them on switch
    side in DCTCP is the use of ECN.

    RFC3168 describes a mechanism for using Explicit Congestion Notification
    from the switch for early detection of congestion, rather than waiting
    for segment loss to occur.

    However, this method only detects the presence of congestion, not
    the *extent*. In the presence of mild congestion, it reduces the TCP
    congestion window too aggressively and unnecessarily affects the
    throughput of long flows [4].

    DCTCP, as mentioned, enhances Explicit Congestion Notification (ECN)
    processing to estimate the fraction of bytes that encounter congestion,
    rather than simply detecting that some congestion has occurred. DCTCP
    then scales the TCP congestion window based on this estimate [4],
    thus it can derive multibit feedback from the information present in
    the single-bit sequence of marks in its control law. And thus act in
    *proportion* to the extent of congestion, not its *presence*.

    Switches therefore set the Congestion Experienced (CE) codepoint in
    packets when internal queue lengths exceed threshold K. Resulting,
    DCTCP delivers the same or better throughput than normal TCP, while
    using 90% less buffer space.

    It was found in [2] that DCTCP enables the applications to handle 10x
    the current background traffic, without impacting foreground traffic.
    Moreover, a 10x increase in foreground traffic did not cause any
    timeouts, and thus largely eliminates TCP incast collapse problems.

    The algorithm itself has already seen deployments in large production
    data centers since then.

    We did a long-term stress-test and analysis in a data center, short
    summary of our TCP incast tests with iperf compared to cubic:

    This test measured DCTCP throughput and latency and compared it with
    CUBIC throughput and latency for an incast scenario. In this test, 19
    senders sent at maximum rate to a single receiver. The receiver simply
    ran iperf -s.

    The senders ran iperf -c -t 30. All senders started
    simultaneously (using local clocks synchronized by ntp).

    This test was repeated multiple times. Below shows the results from a
    single test. Other tests are similar. (DCTCP results were extremely
    consistent, CUBIC results show some variance induced by the TCP timeouts
    that CUBIC encountered.)

    For this test, we report statistics on the number of TCP timeouts,
    flow throughput, and traffic latency.

    1) Timeouts (total over all flows, and per flow summaries):

    CUBIC DCTCP
    Total 3227 25
    Mean 169.842 1.316
    Median 183 1
    Max 207 5
    Min 123 0
    Stddev 28.991 1.600

    Timeout data is taken by measuring the net change in netstat -s
    "other TCP timeouts" reported. As a result, the timeout measurements
    above are not restricted to the test traffic, and we believe that it
    is likely that all of the "DCTCP timeouts" are actually timeouts for
    non-test traffic. We report them nevertheless. CUBIC will also include
    some non-test timeouts, but they are drawfed by bona fide test traffic
    timeouts for CUBIC. Clearly DCTCP does an excellent job of preventing
    TCP timeouts. DCTCP reduces timeouts by at least two orders of
    magnitude and may well have eliminated them in this scenario.

    2) Throughput (per flow in Mbps):

    CUBIC DCTCP
    Mean 521.684 521.895
    Median 464 523
    Max 776 527
    Min 403 519
    Stddev 105.891 2.601
    Fairness 0.962 0.999

    Throughput data was simply the average throughput for each flow
    reported by iperf. By avoiding TCP timeouts, DCTCP is able to
    achieve much better per-flow results. In CUBIC, many flows
    experience TCP timeouts which makes flow throughput unpredictable and
    unfair. DCTCP, on the other hand, provides very clean predictable
    throughput without incurring TCP timeouts. Thus, the standard deviation
    of CUBIC throughput is dramatically higher than the standard deviation
    of DCTCP throughput.

    Mean throughput is nearly identical because even though cubic flows
    suffer TCP timeouts, other flows will step in and fill the unused
    bandwidth. Note that this test is something of a best case scenario
    for incast under CUBIC: it allows other flows to fill in for flows
    experiencing a timeout. Under situations where the receiver is issuing
    requests and then waiting for all flows to complete, flows cannot fill
    in for timed out flows and throughput will drop dramatically.

    3) Latency (in ms):

    CUBIC DCTCP
    Mean 4.0088 0.04219
    Median 4.055 0.0395
    Max 4.2 0.085
    Min 3.32 0.028
    Stddev 0.1666 0.01064

    Latency for each protocol was computed by running "ping -i 0.2
    " from a single sender to the receiver during the incast
    test. For DCTCP, "ping -Q 0x6 -i 0.2 " was used to ensure
    that traffic traversed the DCTCP queue and was not dropped when the
    queue size was greater than the marking threshold. The summary
    statistics above are over all ping metrics measured between the single
    sender, receiver pair.

    The latency results for this test show a dramatic difference between
    CUBIC and DCTCP. CUBIC intentionally overflows the switch buffer
    which incurs the maximum queue latency (more buffer memory will lead
    to high latency.) DCTCP, on the other hand, deliberately attempts to
    keep queue occupancy low. The result is a two orders of magnitude
    reduction of latency with DCTCP - even with a switch with relatively
    little RAM. Switches with larger amounts of RAM will incur increasing
    amounts of latency for CUBIC, but not for DCTCP.

    4) Convergence and stability test:

    This test measured the time that DCTCP took to fairly redistribute
    bandwidth when a new flow commences. It also measured DCTCP's ability
    to remain stable at a fair bandwidth distribution. DCTCP is compared
    with CUBIC for this test.

    At the commencement of this test, a single flow is sending at maximum
    rate (near 10 Gbps) to a single receiver. One second after that first
    flow commences, a new flow from a distinct server begins sending to
    the same receiver as the first flow. After the second flow has sent
    data for 10 seconds, the second flow is terminated. The first flow
    sends for an additional second. Ideally, the bandwidth would be evenly
    shared as soon as the second flow starts, and recover as soon as it
    stops.

    The results of this test are shown below. Note that the flow bandwidth
    for the two flows was measured near the same time, but not
    simultaneously.

    DCTCP performs nearly perfectly within the measurement limitations
    of this test: bandwidth is quickly distributed fairly between the two
    flows, remains stable throughout the duration of the test, and
    recovers quickly. CUBIC, in contrast, is slow to divide the bandwidth
    fairly, and has trouble remaining stable.

    CUBIC DCTCP

    Seconds Flow 1 Flow 2 Seconds Flow 1 Flow 2
    0 9.93 0 0 9.92 0
    0.5 9.87 0 0.5 9.86 0
    1 8.73 2.25 1 6.46 4.88
    1.5 7.29 2.8 1.5 4.9 4.99
    2 6.96 3.1 2 4.92 4.94
    2.5 6.67 3.34 2.5 4.93 5
    3 6.39 3.57 3 4.92 4.99
    3.5 6.24 3.75 3.5 4.94 4.74
    4 6 3.94 4 5.34 4.71
    4.5 5.88 4.09 4.5 4.99 4.97
    5 5.27 4.98 5 4.83 5.01
    5.5 4.93 5.04 5.5 4.89 4.99
    6 4.9 4.99 6 4.92 5.04
    6.5 4.93 5.1 6.5 4.91 4.97
    7 4.28 5.8 7 4.97 4.97
    7.5 4.62 4.91 7.5 4.99 4.82
    8 5.05 4.45 8 5.16 4.76
    8.5 5.93 4.09 8.5 4.94 4.98
    9 5.73 4.2 9 4.92 5.02
    9.5 5.62 4.32 9.5 4.87 5.03
    10 6.12 3.2 10 4.91 5.01
    10.5 6.91 3.11 10.5 4.87 5.04
    11 8.48 0 11 8.49 4.94
    11.5 9.87 0 11.5 9.9 0

    SYN/ACK ECT test:

    This test demonstrates the importance of ECT on SYN and SYN-ACK packets
    by measuring the connection probability in the presence of competing
    flows for a DCTCP connection attempt *without* ECT in the SYN packet.
    The test was repeated five times for each number of competing flows.

    Competing Flows 1 | 2 | 4 | 8 | 16
    ------------------------------
    Mean Connection Probability 1 | 0.67 | 0.45 | 0.28 | 0
    Median Connection Probability 1 | 0.65 | 0.45 | 0.25 | 0

    As the number of competing flows moves beyond 1, the connection
    probability drops rapidly.

    Enabling DCTCP with this patch requires the following steps:

    DCTCP must be running both on the sender and receiver side in your
    data center, i.e.:

    sysctl -w net.ipv4.tcp_congestion_control=dctcp

    Also, ECN functionality must be enabled on all switches in your
    data center for DCTCP to work. The default ECN marking threshold (K)
    heuristic on the switch for DCTCP is e.g., 20 packets (30KB) at
    1Gbps, and 65 packets (~100KB) at 10Gbps (K > 1/7 * C * RTT, [4]).

    In above tests, for each switch port, traffic was segregated into two
    queues. For any packet with a DSCP of 0x01 - or equivalently a TOS of
    0x04 - the packet was placed into the DCTCP queue. All other packets
    were placed into the default drop-tail queue. For the DCTCP queue,
    RED/ECN marking was enabled, here, with a marking threshold of 75 KB.
    More details however, we refer you to the paper [2] under section 3).

    There are no code changes required to applications running in user
    space. DCTCP has been implemented in full *isolation* of the rest of
    the TCP code as its own congestion control module, so that it can run
    without a need to expose code to the core of the TCP stack, and thus
    nothing changes for non-DCTCP users.

    Changes in the CA framework code are minimal, and DCTCP algorithm
    operates on mechanisms that are already available in most Silicon.
    The gain (dctcp_shift_g) is currently a fixed constant (1/16) from
    the paper, but we leave the option that it can be chosen carefully
    to a different value by the user.

    In case DCTCP is being used and ECN support on peer site is off,
    DCTCP falls back after 3WHS to operate in normal TCP Reno mode.

    ss {-4,-6} -t -i diag interface:

    ... dctcp wscale:7,7 rto:203 rtt:2.349/0.026 mss:1448 cwnd:2054
    ssthresh:1102 ce_state 0 alpha 15 ab_ecn 0 ab_tot 735584
    send 10129.2Mbps pacing_rate 20254.1Mbps unacked:1822 retrans:0/15
    reordering:101 rcv_space:29200

    ... dctcp-reno wscale:7,7 rto:201 rtt:0.711/1.327 ato:40 mss:1448
    cwnd:10 ssthresh:1102 fallback_mode send 162.9Mbps pacing_rate
    325.5Mbps rcv_rtt:1.5 rcv_space:29200

    More information about DCTCP can be found in [1-4].

    [1] http://simula.stanford.edu/~alizade/Site/DCTCP.html
    [2] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
    [3] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
    [4] http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00

    Joint work with Florian Westphal and Glenn Judd.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Florian Westphal
    Signed-off-by: Glenn Judd
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

20 Sep, 2014

1 commit

  • This patch provides a receive path for foo-over-udp. This allows
    direct encapsulation of IP protocols over UDP. The bound destination
    port is used to map to an IP protocol, and the XFRM framework
    (udp_encap_rcv) is used to receive encapsulated packets. Upon
    reception, the encapsulation header is logically removed (pointer
    to transport header is advanced) and the packet is reinjected into
    the receive path with the IP protocol indicated by the mapping.

    Netlink is used to configure FOU ports. The configuration information
    includes the port number to bind to and the IP protocol corresponding
    to that port.

    This should support GRE/UDP
    (http://tools.ietf.org/html/draft-yong-tsvwg-gre-in-udp-encap-02),
    as will as the other IP tunneling protocols (IPIP, SIT).

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

15 Jul, 2014

1 commit


25 Feb, 2014

1 commit


07 Jan, 2014

1 commit


04 Jul, 2013

1 commit


20 Jun, 2013

1 commit


12 Jun, 2013

1 commit


08 Jun, 2013

1 commit

  • Would be good to make things explicit and move those functions to
    a new file called tcp_offload.c, thus make this similar to tcpv6_offload.c.
    While moving all related functions into tcp_offload.c, we can also
    make some of them static, since they are only used there. Also, add
    an explicit registration function.

    Suggested-by: Eric Dumazet
    Signed-off-by: Daniel Borkmann
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

27 Mar, 2013

1 commit

  • Following patch refactors GRE code into ip tunneling code and GRE
    specific code. Common tunneling code is moved to ip_tunnel module.
    ip_tunnel module is written as generic library which can be used
    by different tunneling implementations.

    ip_tunnel module contains following components:
    - packet xmit and rcv generic code. xmit flow looks like
    (gre_xmit/ipip_xmit)->ip_tunnel_xmit->ip_local_out.
    - hash table of all devices.
    - lookup for tunnel devices.
    - control plane operations like device create, destroy, ioctl, netlink
    operations code.
    - registration for tunneling modules, like gre, ipip etc.
    - define single pcpu_tstats dev->tstats.
    - struct tnl_ptk_info added to pass parsed tunnel packet parameters.

    ipip.h header is renamed to ip_tunnel.h

    Signed-off-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Pravin B Shelar
     

01 Aug, 2012

1 commit

  • Sanity:

    CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
    CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
    CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
    CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM

    [mhocko@suse.cz: fix missed bits]
    Cc: Glauber Costa
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Aneesh Kumar K.V
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

20 Jul, 2012

1 commit

  • This patch impelements the common code for both the client and server.

    1. TCP Fast Open option processing. Since Fast Open does not have an
    option number assigned by IANA yet, it shares the experiment option
    code 254 by implementing draft-ietf-tcpm-experimental-options
    with a 16 bits magic number 0xF989. This enables global experiments
    without clashing the scarce(2) experimental options available for TCP.

    When the draft status becomes standard (maybe), the client should
    switch to the new option number assigned while the server supports
    both numbers for transistion.

    2. The new sysctl tcp_fastopen

    3. A place holder init function

    Signed-off-by: Yuchung Cheng
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

19 Jul, 2012

1 commit


11 Jul, 2012

1 commit


13 Dec, 2011

1 commit

  • This patch introduces memory pressure controls for the tcp
    protocol. It uses the generic socket memory pressure code
    introduced in earlier patches, and fills in the
    necessary data in cg_proto struct.

    Signed-off-by: Glauber Costa
    Reviewed-by: KAMEZAWA Hiroyuki
    CC: Eric W. Biederman
    Signed-off-by: David S. Miller

    Glauber Costa
     

10 Dec, 2011

1 commit


14 May, 2011

1 commit

  • This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
    ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
    without any special privileges. In other words, the patch makes it
    possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
    order not to increase the kernel's attack surface, the new functionality
    is disabled by default, but is enabled at bootup by supporting Linux
    distributions, optionally with restriction to a group or a group range
    (see below).

    Similar functionality is implemented in Mac OS X:
    http://www.manpagez.com/man/4/icmp/

    A new ping socket is created with

    socket(PF_INET, SOCK_DGRAM, PROT_ICMP)

    Message identifiers (octets 4-5 of ICMP header) are interpreted as local
    ports. Addresses are stored in struct sockaddr_in. No port numbers are
    reserved for privileged processes, port 0 is reserved for API ("let the
    kernel pick a free number"). There is no notion of remote ports, remote
    port numbers provided by the user (e.g. in connect()) are ignored.

    Data sent and received include ICMP headers. This is deliberate to:
    1) Avoid the need to transport headers values like sequence numbers by
    other means.
    2) Make it easier to port existing programs using raw sockets.

    ICMP headers given to send() are checked and sanitized. The type must be
    ICMP_ECHO and the code must be zero (future extensions might relax this,
    see below). The id is set to the number (local port) of the socket, the
    checksum is always recomputed.

    ICMP reply packets received from the network are demultiplexed according
    to their id's, and are returned by recv() without any modifications.
    IP header information and ICMP errors of those packets may be obtained
    via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
    quenches and redirects are reported as fake errors via the error queue
    (IP_RECVERR); the next hop address for redirects is saved to ee_info (in
    network order).

    socket(2) is restricted to the group range specified in
    "/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
    that nobody (not even root) may create ping sockets. Setting it to "100
    100" would grant permissions to the single group (to either make
    /sbin/ping g+s and owned by this group or to grant permissions to the
    "netadmins" group), "0 4294967295" would enable it for the world, "100
    4294967295" would enable it for the users, but not daemons.

    The existing code might be (in the unlikely case anyone needs it)
    extended rather easily to handle other similar pairs of ICMP messages
    (Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
    etc.).

    Userspace ping util & patch for it:
    http://openwall.info/wiki/people/segoon/ping

    For Openwall GNU/*/Linux it was the last step on the road to the
    setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
    is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
    http://mirrors.kernel.org/openwall/Owl/current/iso/

    Initially this functionality was written by Pavel Kankovsky for
    Linux 2.4.32, but unfortunately it was never made public.

    All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
    the patch.

    PATCH v3:
    - switched to flowi4.
    - minor changes to be consistent with raw sockets code.

    PATCH v2:
    - changed ping_debug() to pr_debug().
    - removed CONFIG_IP_PING.
    - removed ping_seq_fops.owner field (unused for procfs).
    - switched to proc_net_fops_create().
    - switched to %pK in seq_printf().

    PATCH v1:
    - fixed checksumming bug.
    - CAP_NET_RAW may not create icmp sockets anymore.

    RFC v2:
    - minor cleanups.
    - introduced sysctl'able group range to restrict socket(2).

    Signed-off-by: Vasiliy Kulikov
    Signed-off-by: David S. Miller

    Vasiliy Kulikov