25 Jul, 2018

1 commit

  • [ Upstream commit 70ba5b6db96ff7324b8cfc87e0d0383cf59c9677 ]

    The low and high values of the net.ipv4.ping_group_range sysctl were
    being silently forced to the default disabled state when a write to the
    sysctl contained GIDs that didn't map to the associated user namespace.
    Confusingly, the sysctl's write operation would return success and then
    a subsequent read of the sysctl would indicate that the low and high
    values are the overflowgid.

    This patch changes the behavior by clearly returning an error when the
    sysctl write operation receives a GID range that doesn't map to the
    associated user namespace. In such a situation, the previous value of
    the sysctl is preserved and that range will be returned in a subsequent
    read of the sysctl.

    Signed-off-by: Tyler Hicks
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Tyler Hicks
     

22 Jul, 2018

1 commit

  • [ Upstream commit c860e997e9170a6d68f9d1e6e2cf61f572191aaf ]

    Fast Open key could be stored in different endian based on the CPU.
    Previously hosts in different endianness in a server farm using
    the same key config (sysctl value) would produce different cookies.
    This patch fixes it by always storing it as little endian to keep
    same API for LE hosts.

    Reported-by: Daniele Iamartino
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Yuchung Cheng
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

01 Aug, 2017

1 commit


16 Jun, 2017

1 commit

  • Add the infrustructure for attaching Upper Layer Protocols (ULPs) over TCP
    sockets. Based on a similar infrastructure in tcp_cong. The idea is that any
    ULP can add its own logic by changing the TCP proto_ops structure to its own
    methods.

    Example usage:

    setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls"));

    modules will call:
    tcp_register_ulp(&tcp_tls_ulp_ops);

    to register/unregister their ulp, with an init function and name.

    A list of registered ulps will be returned by tcp_get_available_ulp, which is
    hooked up to /proc. Example:

    $ cat /proc/sys/net/ipv4/tcp_available_ulp
    tls

    There is currently no functionality to remove or chain ULPs, but
    it should be possible to add these in the future if needed.

    Signed-off-by: Boris Pismenny
    Signed-off-by: Dave Watson
    Signed-off-by: David S. Miller

    Dave Watson
     

08 Jun, 2017

3 commits


25 Apr, 2017

2 commits

  • Middlebox firewall issues can potentially cause server's data being
    blackholed after a successful 3WHS using TFO. Following are the related
    reports from Apple:
    https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf
    Slide 31 identifies an issue where the client ACK to the server's data
    sent during a TFO'd handshake is dropped.
    C ---> syn-data ---> S
    C X S
    [retry and timeout]

    https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf
    Slide 5 shows a similar situation that the server's data gets dropped
    after 3WHS.
    C ---- syn-data ---> S
    C S
    S (accept & write)
    C? X
    Acked-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Wei Wang
     
  • systemd-sysctl is triggering a suspicious RCU usage message when
    net.ipv4.tcp_early_demux or net.ipv4.udp_early_demux is changed via
    a sysctl config file:

    [ 33.896184] ===============================
    [ 33.899558] [ ERR: suspicious RCU usage. ]
    [ 33.900624] 4.11.0-rc7+ #104 Not tainted
    [ 33.901698] -------------------------------
    [ 33.903059] /home/dsa/kernel-2.git/net/ipv4/sysctl_net_ipv4.c:305 suspicious rcu_dereference_check() usage!
    [ 33.905724]
    other info that might help us debug this:

    [ 33.907656]
    rcu_scheduler_active = 2, debug_locks = 0
    [ 33.909288] 1 lock held by systemd-sysctl/143:
    [ 33.910373] #0: (sb_writers#5){.+.+.+}, at: [] file_start_write+0x45/0x48
    [ 33.912407]
    stack backtrace:
    [ 33.914018] CPU: 0 PID: 143 Comm: systemd-sysctl Not tainted 4.11.0-rc7+ #104
    [ 33.915631] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
    [ 33.917870] Call Trace:
    [ 33.918431] dump_stack+0x81/0xb6
    [ 33.919241] lockdep_rcu_suspicious+0x10f/0x118
    [ 33.920263] proc_configure_early_demux+0x65/0x10a
    [ 33.921391] proc_udp_early_demux+0x3a/0x41

    add rcu locking to proc_configure_early_demux.

    Fixes: dddb64bcb3461 ("net: Add sysctl to toggle early demux for tcp and udp")
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

25 Mar, 2017

1 commit

  • Certain system process significant unconnected UDP workload.
    It would be preferrable to disable UDP early demux for those systems
    and enable it for TCP only.

    By disabling UDP demux, we see these slight gains on an ARM64 system-
    782 -> 788Mbps unconnected single stream UDPv4
    633 -> 654Mbps unconnected UDPv4 different sources

    The performance impact can change based on CPU architecure and cache
    sizes. There will not much difference seen if entire UDP hash table
    is in cache.

    Both sysctls are enabled by default to preserve existing behavior.

    v1->v2: Change function pointer instead of adding conditional as
    suggested by Stephen.

    v2->v3: Read once in callers to avoid issues due to compiler
    optimizations. Also update commit message with the tests.

    v3->v4: Store and use read once result instead of querying pointer
    again incorrectly.

    v4->v5: Refactor to avoid errors due to compilation with IPV6={m,n}

    Signed-off-by: Subash Abhinov Kasiviswanathan
    Suggested-by: Eric Dumazet
    Cc: Stephen Hemminger
    Cc: Tom Herbert
    Cc: David Miller
    Signed-off-by: David S. Miller

    subashab@codeaurora.org
     

22 Mar, 2017

1 commit

  • This patch adds support for ECMP hash policy choice via a new sysctl
    called fib_multipath_hash_policy and also adds support for L4 hashes.
    The current values for fib_multipath_hash_policy are:
    0 - layer 3 (default)
    1 - layer 4
    If there's an skb hash already set and it matches the chosen policy then it
    will be used instead of being calculated (currently only for L4).
    In L3 mode we always calculate the hash due to the ICMP error special
    case, the flow dissector's field consistentification should handle the
    address order thus we can remove the address reversals.
    If the skb is provided we always use it for the hash calculation,
    otherwise we fallback to fl4, that is if skb is NULL fl4 has to be set.

    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     

17 Mar, 2017

1 commit

  • The tcp_tw_recycle was already broken for connections
    behind NAT, since the per-destination timestamp is not
    monotonically increasing for multiple machines behind
    a single destination address.

    After the randomization of TCP timestamp offsets
    in commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets
    for each connection), the tcp_tw_recycle is broken for all
    types of connections for the same reason: the timestamps
    received from a single machine is not monotonically increasing,
    anymore.

    Remove tcp_tw_recycle, since it is not functional. Also, remove
    the PAWSPassive SNMP counter since it is only used for
    tcp_tw_recycle, and simplify tcp_v4_route_req and tcp_v6_route_req
    since the strict argument is only set when tcp_tw_recycle is
    enabled.

    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Eric Dumazet
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Cc: Lutz Vieweg
    Cc: Florian Westphal
    Signed-off-by: David S. Miller

    Soheil Hassas Yeganeh
     

31 Jan, 2017

1 commit

  • Packets arriving in a VRF currently are delivered to UDP sockets that
    aren't bound to any interface. TCP defaults to not delivering packets
    arriving in a VRF to unbound sockets. IP route lookup and socket
    transmit both assume that unbound means using the default table and
    UDP applications that haven't been changed to be aware of VRFs may not
    function correctly in this case since they may not be able to handle
    overlapping IP address ranges, or be able to send packets back to the
    original sender if required.

    So add a sysctl, udp_l3mdev_accept, to control this behaviour with it
    being analgous to the existing tcp_l3mdev_accept, namely to allow a
    process to have a VRF-global listen socket. Have this default to off
    as this is the behaviour that users will expect, given that there is
    no explicit mechanism to set unmodified VRF-unaware application into a
    default VRF.

    Signed-off-by: Robert Shearman
    Acked-by: David Ahern
    Tested-by: David Ahern
    Signed-off-by: David S. Miller

    Robert Shearman
     

25 Jan, 2017

1 commit

  • Add net.ipv4.ip_unprivileged_port_start, which is a per namespace sysctl
    that denotes the first unprivileged inet port in the namespace. To
    disable all privileged ports set this to zero. It also checks for
    overlap with the local port range. The privileged and local range may
    not overlap.

    The use case for this change is to allow containerized processes to bind
    to priviliged ports, but prevent them from ever being allowed to modify
    their container's network configuration. The latter is accomplished by
    ensuring that the network namespace is not a child of the user
    namespace. This modification was needed to allow the container manager
    to disable a namespace's priviliged port restrictions without exposing
    control of the network namespace to processes in the user namespace.

    Signed-off-by: Krister Johansen
    Signed-off-by: David S. Miller

    Krister Johansen
     

14 Jan, 2017

1 commit

  • Thin stream DUPACK is to start fast recovery on only one DUPACK
    provided the connection is a thin stream (i.e., low inflight). But
    this older feature is now subsumed with RACK. If a connection
    receives only a single DUPACK, RACK would arm a reordering timer
    and soon starts fast recovery instead of timeout if no further
    ACKs are received.

    The socket option (THIN_DUPACK) is kept as a nop for compatibility.
    Note that this patch does not change another thin-stream feature
    which enables linear RTO. Although it might be good to generalize
    that in the future (i.e., linear RTO for the first say 3 retries).

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

12 Jan, 2017

1 commit


10 Jan, 2017

1 commit

  • > cat /proc/sys/net/ipv4/tcp_notsent_lowat
    -1
    > echo 4294967295 > /proc/sys/net/ipv4/tcp_notsent_lowat
    -bash: echo: write error: Invalid argument
    > echo -2147483648 > /proc/sys/net/ipv4/tcp_notsent_lowat
    > cat /proc/sys/net/ipv4/tcp_notsent_lowat
    -2147483648

    but in documentation we have "tcp_notsent_lowat - UNSIGNED INTEGER"

    v2: simplify to just proc_douintvec
    Signed-off-by: Pavel Tikhomirov
    Signed-off-by: David S. Miller

    Pavel Tikhomirov
     

30 Dec, 2016

2 commits


28 Dec, 2016

1 commit

  • Different namespaces might have different requirements to reuse
    TIME-WAIT sockets for new connections. This might be required in
    cases where different namespace applications are in place which
    require TIME_WAIT socket connections to be reduced independently
    of the host.

    Signed-off-by: Haishuang Yan
    Signed-off-by: David S. Miller

    Haishuang Yan
     

23 Oct, 2016

1 commit

  • This reverts commit a681574c99be23e4d20b769bf0e543239c364af5
    ("ipv4: disable BH in set_ping_group_range()") because we never
    read ping_group_range in BH context (unlike local_port_range).

    Then, since we already have a lock for ping_group_range, those
    using ip_local_ports.lock for ping_group_range are clearly typos.

    We might consider to share a same lock for both ping_group_range
    and local_port_range w.r.t. space saving, but that should be for
    net-next.

    Fixes: a681574c99be ("ipv4: disable BH in set_ping_group_range()")
    Fixes: ba6b918ab234 ("ping: move ping_group_range out of CONFIG_SYSCTL")
    Cc: Eric Dumazet
    Cc: Eric Salo
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     

21 Oct, 2016

1 commit

  • In commit 4ee3bd4a8c746 ("ipv4: disable BH when changing ip local port
    range") Cong added BH protection in set_local_port_range() but missed
    that same fix was needed in set_ping_group_range()

    Fixes: b8f1a55639e6 ("udp: Add function to make source port for UDP tunnels")
    Signed-off-by: Eric Dumazet
    Reported-by: Eric Salo
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 May, 2016

1 commit

  • Commit fa50d974d104 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
    moves the default TTL assignment, and as side-effect IPv4 TTL now
    has a default value only if sysctl support is enabled (CONFIG_SYSCTL=y).

    The sysctl_ip_default_ttl is fundamental for IP to work properly,
    as it provides the TTL to be used as default. The defautl TTL may be
    used in ip_selected_ttl, through the following flow:

    ip_select_ttl
    ip4_dst_hoplimit
    net->ipv4.sysctl_ip_default_ttl

    This commit fixes the issue by assigning net->ipv4.sysctl_ip_default_ttl
    in net_init_net, called during ipv4's initialization.

    Without this commit, a kernel built without sysctl support will send
    all IP packets with zero TTL (unless a TTL is explicitly set, e.g.
    with setsockopt).

    Given a similar issue might appear on the other knobs that were
    namespaceify, this commit also moves them.

    Fixes: fa50d974d104 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
    Signed-off-by: Ezequiel Garcia
    Signed-off-by: David S. Miller

    Ezequiel Garcia
     

12 Apr, 2016

1 commit

  • Multipath route lookups should consider knowledge about next hops and not
    select a hop that is known to be failed.

    Example:

    [h2] [h3] 15.0.0.5
    | |
    3| 3|
    [SP1] [SP2]--+
    1 2 1 2
    | | /-------------+ |
    | \ / |
    | X |
    | / \ |
    | / \---------------\ |
    1 2 1 2
    12.0.0.2 [TOR1] 3-----------------3 [TOR2] 12.0.0.3
    4 4
    \ /
    \ /
    \ /
    -------| |-----/
    1 2
    [TOR3]
    3|
    |
    [h1] 12.0.0.1

    host h1 with IP 12.0.0.1 has 2 paths to host h3 at 15.0.0.5:

    root@h1:~# ip ro ls
    ...
    12.0.0.0/24 dev swp1 proto kernel scope link src 12.0.0.1
    15.0.0.0/16
    nexthop via 12.0.0.2 dev swp1 weight 1
    nexthop via 12.0.0.3 dev swp1 weight 1
    ...

    If the link between tor3 and tor1 is down and the link between tor1
    and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups
    in h1 are alternating between the 2 routes: ping 15.0.0.5 gets one and
    ssh 15.0.0.5 gets the other. Connections that attempt to use the
    12.0.0.2 nexthop fail since that neighbor is not reachable:

    root@h1:~# ip neigh show
    ...
    12.0.0.3 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE
    12.0.0.2 dev swp1 FAILED
    ...

    The failed path can be avoided by considering known neighbor information
    when selecting next hops. If the neighbor lookup fails we have no
    knowledge about the nexthop, so give it a shot. If there is an entry
    then only select the nexthop if the state is sane. This is similar to
    what fib_detect_death does.

    To maintain backward compatibility use of the neighbor information is
    based on a new sysctl, fib_multipath_use_neigh.

    Signed-off-by: David Ahern
    Reviewed-by: Julian Anastasov
    Signed-off-by: David S. Miller

    David Ahern
     

17 Feb, 2016

3 commits


11 Feb, 2016

4 commits


08 Feb, 2016

8 commits