04 Nov, 2017

1 commit


03 Nov, 2017

1 commit

  • icsk_accept_queue.fastopenq.lock is only fully initialized at listen()
    time.

    LOCKDEP is not happy if we attempt a spin_lock_bh() on it, because
    of missing annotation. (Although kernel runs just fine)

    Lets use net->ipv4.tcp_fastopen_ctx_lock to protect ctx access.

    Fixes: 1fba70e5b6be ("tcp: socket option to set TCP fast open key")
    Signed-off-by: Eric Dumazet
    Cc: Yuchung Cheng
    Cc: Christoph Paasch
    Reviewed-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

24 Oct, 2017

1 commit

  • We already allow to enable TFO without a cookie by using the
    fastopen-sysctl and setting it to TFO_SERVER_COOKIE_NOT_REQD (or
    TFO_CLIENT_NO_COOKIE).
    This is safe to do in certain environments where we know that there
    isn't a malicous host (aka., data-centers) or when the
    application-protocol already provides an authentication mechanism in the
    first flight of data.

    A server however might be providing multiple services or talking to both
    sides (public Internet and data-center). So, this server would want to
    enable cookie-less TFO for certain services and/or for connections that
    go to the data-center.

    This patch exposes a socket-option and a per-route attribute to enable such
    fine-grained configurations.

    Signed-off-by: Christoph Paasch
    Reviewed-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Christoph Paasch
     

20 Oct, 2017

1 commit

  • New socket option TCP_FASTOPEN_KEY to allow different keys per
    listener. The listener by default uses the global key until the
    socket option is set. The key is a 16 bytes long binary data. This
    option has no effect on regular non-listener TCP sockets.

    Signed-off-by: Yuchung Cheng
    Reviewed-by: Eric Dumazet
    Reviewed-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

07 Oct, 2017

1 commit


06 Oct, 2017

1 commit

  • Currently in the TCP code, the initialization sequence for cached
    metrics, congestion control, BPF, etc, after successful connection
    is very inconsistent. This introduces inconsistent bevhavior and is
    prone to bugs. The current call sequence is as follows:

    (1) for active case (tcp_finish_connect() case):
    tcp_mtup_init(sk);
    icsk->icsk_af_ops->rebuild_header(sk);
    tcp_init_metrics(sk);
    tcp_call_bpf(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB);
    tcp_init_congestion_control(sk);
    tcp_init_buffer_space(sk);

    (2) for passive case (tcp_rcv_state_process() TCP_SYN_RECV case):
    icsk->icsk_af_ops->rebuild_header(sk);
    tcp_call_bpf(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
    tcp_init_congestion_control(sk);
    tcp_mtup_init(sk);
    tcp_init_buffer_space(sk);
    tcp_init_metrics(sk);

    (3) for TFO passive case (tcp_fastopen_create_child()):
    inet_csk(child)->icsk_af_ops->rebuild_header(child);
    tcp_init_congestion_control(child);
    tcp_mtup_init(child);
    tcp_init_metrics(child);
    tcp_call_bpf(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
    tcp_init_buffer_space(child);

    This commit uniforms the above functions to have the following sequence:
    tcp_mtup_init(sk);
    icsk->icsk_af_ops->rebuild_header(sk);
    tcp_init_metrics(sk);
    tcp_call_bpf(sk, BPF_SOCK_OPS_ACTIVE/PASSIVE_ESTABLISHED_CB);
    tcp_init_congestion_control(sk);
    tcp_init_buffer_space(sk);
    This sequence is the same as the (1) active case. We pick this sequence
    because this order correctly allows BPF to override the settings
    including congestion control module and initial cwnd, etc from
    the route, and then allows the CC module to see those settings.

    Suggested-by: Neal Cardwell
    Tested-by: Neal Cardwell
    Signed-off-by: Wei Wang
    Acked-by: Neal Cardwell
    Acked-by: Yuchung Cheng
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Wei Wang
     

02 Oct, 2017

4 commits

  • Different namespace application might require different time period in
    second to disable Fastopen on active TCP sockets.

    Tested:
    Simulate following similar situation that the server's data gets dropped
    after 3WHS.
    C ---- syn-data ---> S
    C S
    S (accept & write)
    C? X
    Signed-off-by: David S. Miller

    Haishuang Yan
     
  • Different namespace application might require different tcp_fastopen_key
    independently of the host.

    David Miller pointed out there is a leak without releasing the context
    of tcp_fastopen_key during netns teardown. So add the release action in
    exit_batch path.

    Tested:
    1. Container namespace:
    # cat /proc/sys/net/ipv4/tcp_fastopen_key:
    2817fff2-f803cf97-eadfd1f3-78c0992b

    cookie key in tcp syn packets:
    Fast Open Cookie
    Kind: TCP Fast Open Cookie (34)
    Length: 10
    Fast Open Cookie: 1e5dd82a8c492ca9

    2. Host:
    # cat /proc/sys/net/ipv4/tcp_fastopen_key:
    107d7c5f-68eb2ac7-02fb06e6-ed341702

    cookie key in tcp syn packets:
    Fast Open Cookie
    Kind: TCP Fast Open Cookie (34)
    Length: 10
    Fast Open Cookie: e213c02bf0afbc8a

    Signed-off-by: Haishuang Yan
    Signed-off-by: David S. Miller

    Haishuang Yan
     
  • The 'publish' logic is not necessary after commit dfea2aa65424 ("tcp:
    Do not call tcp_fastopen_reset_cipher from interrupt context"), because
    in tcp_fastopen_cookie_gen,it wouldn't call tcp_fastopen_init_key_once.

    Signed-off-by: Haishuang Yan
    Signed-off-by: David S. Miller

    Haishuang Yan
     
  • Different namespace application might require enable TCP Fast Open
    feature independently of the host.

    This patch series continues making more of the TCP Fast Open related
    sysctl knobs be per net-namespace.

    Reported-by: Luca BRUNO
    Signed-off-by: Haishuang Yan
    Signed-off-by: David S. Miller

    Haishuang Yan
     

23 Aug, 2017

1 commit


02 Jul, 2017

1 commit

  • Added callbacks to BPF SOCK_OPS type program before an active
    connection is intialized and after a passive or active connection is
    established.

    The following patch demostrates how they can be used to set send and
    receive buffer sizes.

    Signed-off-by: Lawrence Brakmo
    Signed-off-by: David S. Miller

    Lawrence Brakmo
     

01 Jul, 2017

1 commit

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    This patch uses refcount_inc_not_zero() instead of
    atomic_inc_not_zero_hint() due to absense of a _hint()
    version of refcount API. If the hint() version must
    be used, we might need to revisit API.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     

25 Apr, 2017

2 commits

  • This counter records the number of times the firewall blackhole issue is
    detected and active TFO is disabled.

    Signed-off-by: Wei Wang
    Acked-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Wei Wang
     
  • Middlebox firewall issues can potentially cause server's data being
    blackholed after a successful 3WHS using TFO. Following are the related
    reports from Apple:
    https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf
    Slide 31 identifies an issue where the client ACK to the server's data
    sent during a TFO'd handshake is dropped.
    C ---> syn-data ---> S
    C X S
    [retry and timeout]

    https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf
    Slide 5 shows a similar situation that the server's data gets dropped
    after 3WHS.
    C ---- syn-data ---> S
    C S
    S (accept & write)
    C? X
    Acked-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Wei Wang
     

28 Jan, 2017

1 commit


26 Jan, 2017

2 commits

  • This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
    alternative way to perform Fast Open on the active side (client). Prior
    to this patch, a client needs to replace the connect() call with
    sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
    to use Fast Open: these socket operations are often done in lower layer
    libraries used by many other applications. Changing these libraries
    and/or the socket call sequences are not trivial. A more convenient
    approach is to perform Fast Open by simply enabling a socket option when
    the socket is created w/o changing other socket calls sequence:
    s = socket()
    create a new socket
    setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
    newly introduced sockopt
    If set, new functionality described below will be used.
    Return ENOTSUPP if TFO is not supported or not enabled in the
    kernel.

    connect()
    With cookie present, return 0 immediately.
    With no cookie, initiate 3WHS with TFO cookie-request option and
    return -1 with errno = EINPROGRESS.

    write()/sendmsg()
    With cookie present, send out SYN with data and return the number of
    bytes buffered.
    With no cookie, and 3WHS not yet completed, return -1 with errno =
    EINPROGRESS.
    No MSG_FASTOPEN flag is needed.

    read()
    Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
    write() is not called yet.
    Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
    established but no msg is received yet.
    Return number of bytes read if socket is established and there is
    msg received.

    The new API simplifies life for applications that always perform a write()
    immediately after a successful connect(). Such applications can now take
    advantage of Fast Open by merely making one new setsockopt() call at the time
    of creating the socket. Nothing else about the application's socket call
    sequence needs to change.

    Signed-off-by: Wei Wang
    Acked-by: Eric Dumazet
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Wei Wang
     
  • Refactor the cookie check logic in tcp_send_syn_data() into a function.
    This function will be called else where in later changes.

    Signed-off-by: Wei Wang
    Acked-by: Eric Dumazet
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Wei Wang
     

20 Jan, 2017

1 commit

  • Found that if we run LTP netstress test with large MSS (65K),
    the first attempt from server to send data comparable to this
    MSS on fastopen connection will be delayed by the probe timer.

    Here is an example:

    < S seq 0:0 win 43690 options [mss 65495 wscale 7 tfo cookie] length 32
    > S. seq 0:0 ack 1 win 43690 options [mss 65495 wscale 7] length 0
    < . ack 1 win 342 length 0

    Inside tcp_sendmsg(), tcp_send_mss() returns max MSS in 'mss_now',
    as well as in 'size_goal'. This results the segment not queued for
    transmition until all the data copied from user buffer. Then, inside
    __tcp_push_pending_frames(), it breaks on send window test and
    continues with the check probe timer.

    Fragmentation occurs in tcp_write_wakeup()...

    +0.2 > P. seq 1:43777 ack 1 win 342 length 43776
    < . ack 43777, win 1365 length 0
    > P. seq 43777:65001 ack 1 win 342 options [...] length 21224
    ...

    This also contradicts with the fact that we should bound to the half
    of the window if it is large.

    Fix this flaw by correctly initializing max_window. Before that, it
    could have large values that affect further calculations of 'size_goal'.

    Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
    Signed-off-by: Alexey Kodanev
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alexey Kodanev
     

14 Jan, 2017

1 commit

  • Fix up a data alignment issue on sparc by swapping the order
    of the cookie byte array field with the length field in
    struct tcp_fastopen_cookie, and making it a proper union
    to clean up the typecasting.

    This addresses log complaints like these:
    log_unaligned: 113 callbacks suppressed
    Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360
    Kernel unaligned access at TPC[9764ac] tcp_try_fastopen+0x2ec/0x360
    Kernel unaligned access at TPC[9764c8] tcp_try_fastopen+0x308/0x360
    Kernel unaligned access at TPC[9764e4] tcp_try_fastopen+0x324/0x360
    Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360

    Cc: Eric Dumazet
    Signed-off-by: Shannon Nelson
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Shannon Nelson
     

09 Sep, 2016

1 commit

  • When DATA and/or FIN are carried in a SYN/ACK message or SYN message,
    we append an skb in socket receive queue, but we forget to call
    sk_forced_mem_schedule().

    Effect is that the socket has a negative sk->sk_forward_alloc as long as
    the message is not read by the application.

    Josh Hunt fixed a similar issue in commit d22e15371811 ("tcp: fix tcp
    fin memory accounting")

    Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
    Signed-off-by: Eric Dumazet
    Reviewed-by: Josh Hunt
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Sep, 2016

1 commit

  • Yuchung noticed that on the first TFO server data packet sent after
    the (TFO) handshake, the server echoed the TCP timestamp value in the
    SYN/data instead of the timestamp value in the final ACK of the
    handshake. This problem did not happen on regular opens.

    The tcp_replace_ts_recent() logic that decides whether to remember an
    incoming TS value needs tp->rcv_wup to hold the latest receive
    sequence number that we have ACKed (latest tp->rcv_nxt we have
    ACKed). This commit fixes this issue by ensuring that a TFO server
    properly updates tp->rcv_wup to match tp->rcv_nxt at the time it sends
    a SYN/ACK for the SYN/data.

    Reported-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
    Signed-off-by: David S. Miller

    Neal Cardwell
     

03 May, 2016

1 commit

  • We want to to make TCP stack preemptible, as draining prequeue
    and backlog queues can take lot of time.

    Many SNMP updates were assuming that BH (and preemption) was disabled.

    Need to convert some __NET_INC_STATS() calls to NET_INC_STATS()
    and some __TCP_INC_STATS() to TCP_INC_STATS()

    Before using this_cpu_ptr(net->ipv4.tcp_sk) in tcp_v4_send_reset()
    and tcp_v4_send_ack(), we add an explicit preempt disabled section.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Apr, 2016

1 commit


20 Mar, 2016

1 commit

  • Pull networking updates from David Miller:
    "Highlights:

    1) Support more Realtek wireless chips, from Jes Sorenson.

    2) New BPF types for per-cpu hash and arrap maps, from Alexei
    Starovoitov.

    3) Make several TCP sysctls per-namespace, from Nikolay Borisov.

    4) Allow the use of SO_REUSEPORT in order to do per-thread processing
    of incoming TCP/UDP connections. The muxing can be done using a
    BPF program which hashes the incoming packet. From Craig Gallek.

    5) Add a multiplexer for TCP streams, to provide a messaged based
    interface. BPF programs can be used to determine the message
    boundaries. From Tom Herbert.

    6) Add 802.1AE MACSEC support, from Sabrina Dubroca.

    7) Avoid factorial complexity when taking down an inetdev interface
    with lots of configured addresses. We were doing things like
    traversing the entire address less for each address removed, and
    flushing the entire netfilter conntrack table for every address as
    well.

    8) Add and use SKB bulk free infrastructure, from Jesper Brouer.

    9) Allow offloading u32 classifiers to hardware, and implement for
    ixgbe, from John Fastabend.

    10) Allow configuring IRQ coalescing parameters on a per-queue basis,
    from Kan Liang.

    11) Extend ethtool so that larger link mode masks can be supported.
    From David Decotigny.

    12) Introduce devlink, which can be used to configure port link types
    (ethernet vs Infiniband, etc.), port splitting, and switch device
    level attributes as a whole. From Jiri Pirko.

    13) Hardware offload support for flower classifiers, from Amir Vadai.

    14) Add "Local Checksum Offload". Basically, for a tunneled packet
    the checksum of the outer header is 'constant' (because with the
    checksum field filled into the inner protocol header, the payload
    of the outer frame checksums to 'zero'), and we can take advantage
    of that in various ways. From Edward Cree"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1548 commits)
    bonding: fix bond_get_stats()
    net: bcmgenet: fix dma api length mismatch
    net/mlx4_core: Fix backward compatibility on VFs
    phy: mdio-thunder: Fix some Kconfig typos
    lan78xx: add ndo_get_stats64
    lan78xx: handle statistics counter rollover
    RDS: TCP: Remove unused constant
    RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket
    net: smc911x: convert pxa dma to dmaengine
    team: remove duplicate set of flag IFF_MULTICAST
    bonding: remove duplicate set of flag IFF_MULTICAST
    net: fix a comment typo
    ethernet: micrel: fix some error codes
    ip_tunnels, bpf: define IP_TUNNEL_OPTS_MAX and use it
    bpf, dst: add and use dst_tclassid helper
    bpf: make skb->tc_classid also readable
    net: mvneta: bm: clarify dependencies
    cls_bpf: reset class and reuse major in da
    ldmvsw: Checkpatch sunvnet.c and sunvnet_common.c
    ldmvsw: Add ldmvsw.c driver code
    ...

    Linus Torvalds
     

15 Mar, 2016

1 commit

  • Per RFC4898, they count segments sent/received
    containing a positive length data segment (that includes
    retransmission segments carrying data). Unlike
    tcpi_segs_out/in, tcpi_data_segs_out/in excludes segments
    carrying no data (e.g. pure ack).

    The patch also updates the segs_in in tcp_fastopen_add_skb()
    so that segs_in >= data_segs_in property is kept.

    Together with retransmission data, tcpi_data_segs_out
    gives a better signal on the rxmit rate.

    v6: Rebase on the latest net-next

    v5: Eric pointed out that checking skb->len is still needed in
    tcp_fastopen_add_skb() because skb can carry a FIN without data.
    Hence, instead of open coding segs_in and data_segs_in, tcp_segs_in()
    helper is used. Comment is added to the fastopen case to explain why
    segs_in has to be reset and tcp_segs_in() has to be called before
    __skb_pull().

    v4: Add comment to the changes in tcp_fastopen_add_skb()
    and also add remark on this case in the commit message.

    v3: Add const modifier to the skb parameter in tcp_segs_in()

    v2: Rework based on recent fix by Eric:
    commit a9d99ce28ed3 ("tcp: fix tcpi_segs_in after connection establishment")

    Signed-off-by: Martin KaFai Lau
    Cc: Chris Rapier
    Cc: Eric Dumazet
    Cc: Marcelo Ricardo Leitner
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     

07 Feb, 2016

1 commit

  • When we acknowledge a FIN, it is not enough to ack the sequence number
    and queue the skb into receive queue. We also have to call tcp_fin()
    to properly update socket state and send proper poll() notifications.

    It seems we also had the problem if we received a SYN packet with the
    FIN flag set, but it does not seem an urgent issue, as no known
    implementation can do that.

    Fixes: 61d2bcae99f6 ("tcp: fastopen: accept data/FIN present in SYNACK message")
    Signed-off-by: Eric Dumazet
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 Feb, 2016

2 commits

  • If we remove the SYN flag from the skbs that tcp_fastopen_add_skb()
    places in socket receive queue, then we can remove the test that
    tcp_recvmsg() has to perform in fast path.

    All we have to do is to adjust SEQ in the slow path.

    For the moment, we place an unlikely() and output a message
    if we find an skb having SYN flag set.
    Goal would be to get rid of the test completely.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • RFC 7413 (TCP Fast Open) 4.2.2 states that the SYNACK message
    MAY include data and/or FIN

    This patch adds support for the client side :

    If we receive a SYNACK with payload or FIN, queue the skb instead
    of ignoring it.

    Since we already support the same for SYN, we refactor the existing
    code and reuse it. Note we need to clone the skb, so this operation
    might fail under memory pressure.

    Sara Dickinson pointed out FreeBSD server Fast Open implementation
    was planned to generate such SYNACK in the future.

    The server side might be implemented on linux later.

    Reported-by: Sara Dickinson
    Signed-off-by: Eric Dumazet
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Jan, 2016

1 commit


23 Oct, 2015

1 commit

  • Multiple cpus can process duplicates of incoming ACK messages
    matching a SYN_RECV request socket. This is a rare event under
    normal operations, but definitely can happen.

    Only one must win the race, otherwise corruption would occur.

    To fix this without adding new atomic ops, we use logic in
    inet_ehash_nolisten() to detect the request was present in the same
    ehash bucket where we try to insert the new child.

    If request socket was not found, we have to undo the child creation.

    This actually removes a spin_lock()/spin_unlock() pair in
    reqsk_queue_unlink() for the fast path.

    Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
    Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

05 Oct, 2015

1 commit

  • There are multiple races that need fixes :

    1) skb_get() + queue skb + kfree_skb() is racy

    An accept() can be done on another cpu, data consumed immediately.
    tcp_recvmsg() uses __kfree_skb() as it is assumed all skb found in
    socket receive queue are private.

    Then the kfree_skb() in tcp_rcv_state_process() uses an already freed skb

    2) tcp_reqsk_record_syn() needs to be done before tcp_try_fastopen()
    for the same reasons.

    3) We want to send the SYNACK before queueing child into accept queue,
    otherwise we might reintroduce the ooo issue fixed in
    commit 7c85af881044 ("tcp: avoid reorders for TFO passive connections")

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 Oct, 2015

1 commit

  • If a listen backlog is very big (to avoid syncookies), then
    the listener sk->sk_wmem_alloc is the main source of false
    sharing, as we need to touch it twice per SYNACK re-transmit
    and TX completion.

    (One SYN packet takes listener lock once, but up to 6 SYNACK
    are generated)

    By attaching the skb to the request socket, we remove this
    source of contention.

    Tested:

    listen(fd, 10485760); // single listener (no SO_REUSEPORT)
    16 RX/TX queue NIC
    Sustain a SYNFLOOD attack of ~320,000 SYN per second,
    Sending ~1,400,000 SYNACK per second.
    Perf profiles now show listener spinlock being next bottleneck.

    20.29% [kernel] [k] queued_spin_lock_slowpath
    10.06% [kernel] [k] __inet_lookup_established
    5.12% [kernel] [k] reqsk_timer_handler
    3.22% [kernel] [k] get_next_timer_interrupt
    3.00% [kernel] [k] tcp_make_synack
    2.77% [kernel] [k] ipt_do_table
    2.70% [kernel] [k] run_timer_softirq
    2.50% [kernel] [k] ip_finish_output
    2.04% [kernel] [k] cascade

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Sep, 2015

1 commit

  • While auditing TCP stack for upcoming 'lockless' listener changes,
    I found I had to change fastopen_init_queue() to properly init the object
    before publishing it.

    Otherwise an other cpu could try to lock the spinlock before it gets
    properly initialized.

    Instead of adding appropriate barriers, just remove dynamic memory
    allocations :
    - Structure is 28 bytes on 64bit arches. Using additional 8 bytes
    for holding a pointer seems overkill.
    - Two listeners can share same cache line and performance would suffer.

    If we really want to save few bytes, we would instead dynamically allocate
    whole struct request_sock_queue in the future.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

29 Sep, 2015

1 commit

  • We found that a TCP Fast Open passive connection was vulnerable
    to reorders, as the exchange might look like

    [1] C -> S S
    [2] S -> C S. ack request
    [3] S -> C .

    packets [2] and [3] can be generated at almost the same time.

    If C receives the 3rd packet before the 2nd, it will drop it as
    the socket is in SYN_SENT state and expects a SYNACK.

    S will have to retransmit the answer.

    Current OOO avoidance in linux is defeated because SYNACK
    packets are attached to the LISTEN socket, while DATA packets
    are attached to the children. They might be sent by different cpus,
    and different TX queues might be selected.

    It turns out that for TFO, we created a child, which is a
    full blown socket in TCP_SYN_RECV state, and we simply can attach
    the SYNACK packet to this socket.

    This means that at the time tcp_sendmsg() pushes DATA packet,
    skb->ooo_okay will be set iff the SYNACK packet had been sent
    and TX completed.

    This removes the reorder source at the host level.

    We also removed the export of tcp_try_fastopen(), as it is no
    longer called from IPv6.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Jun, 2015

1 commit

  • tcp_fastopen_reset_cipher really cannot be called from interrupt
    context. It allocates the tcp_fastopen_context with GFP_KERNEL and
    calls crypto_alloc_cipher, which allocates all kind of stuff with
    GFP_KERNEL.

    Thus, we might sleep when the key-generation is triggered by an
    incoming TFO cookie-request which would then happen in interrupt-
    context, as shown by enabling CONFIG_DEBUG_ATOMIC_SLEEP:

    [ 36.001813] BUG: sleeping function called from invalid context at mm/slub.c:1266
    [ 36.003624] in_atomic(): 1, irqs_disabled(): 0, pid: 1016, name: packetdrill
    [ 36.004859] CPU: 1 PID: 1016 Comm: packetdrill Not tainted 4.1.0-rc7 #14
    [ 36.006085] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
    [ 36.008250] 00000000000004f2 ffff88007f8838a8 ffffffff8171d53a ffff880075a084a8
    [ 36.009630] ffff880075a08000 ffff88007f8838c8 ffffffff810967d3 ffff88007f883928
    [ 36.011076] 0000000000000000 ffff88007f8838f8 ffffffff81096892 ffff88007f89be00
    [ 36.012494] Call Trace:
    [ 36.012953] [] dump_stack+0x4f/0x6d
    [ 36.014085] [] ___might_sleep+0x103/0x170
    [ 36.015117] [] __might_sleep+0x52/0x90
    [ 36.016117] [] kmem_cache_alloc_trace+0x47/0x190
    [ 36.017266] [] ? tcp_fastopen_reset_cipher+0x42/0x130
    [ 36.018485] [] tcp_fastopen_reset_cipher+0x42/0x130
    [ 36.019679] [] tcp_fastopen_init_key_once+0x61/0x70
    [ 36.020884] [] __tcp_fastopen_cookie_gen+0x1c/0x60
    [ 36.022058] [] tcp_try_fastopen+0x58f/0x730
    [ 36.023118] [] tcp_conn_request+0x3e8/0x7b0
    [ 36.024185] [] ? __module_text_address+0x12/0x60
    [ 36.025327] [] tcp_v4_conn_request+0x51/0x60
    [ 36.026410] [] tcp_rcv_state_process+0x190/0xda0
    [ 36.027556] [] ? __inet_lookup_established+0x47/0x170
    [ 36.028784] [] tcp_v4_do_rcv+0x16d/0x3d0
    [ 36.029832] [] ? security_sock_rcv_skb+0x16/0x20
    [ 36.030936] [] tcp_v4_rcv+0x77a/0x7b0
    [ 36.031875] [] ? iptable_filter_hook+0x33/0x70
    [ 36.032953] [] ip_local_deliver_finish+0x92/0x1f0
    [ 36.034065] [] ip_local_deliver+0x9a/0xb0
    [ 36.035069] [] ? ip_rcv+0x3d0/0x3d0
    [ 36.035963] [] ip_rcv_finish+0x119/0x330
    [ 36.036950] [] ip_rcv+0x2e7/0x3d0
    [ 36.037847] [] __netif_receive_skb_core+0x552/0x930
    [ 36.038994] [] __netif_receive_skb+0x27/0x70
    [ 36.040033] [] process_backlog+0xd2/0x1f0
    [ 36.041025] [] net_rx_action+0x122/0x310
    [ 36.042007] [] __do_softirq+0x103/0x2f0
    [ 36.042978] [] do_softirq_own_stack+0x1c/0x30

    This patch moves the call to tcp_fastopen_init_key_once to the places
    where a listener socket creates its TFO-state, which always happens in
    user-context (either from the setsockopt, or implicitly during the
    listen()-call)

    Cc: Eric Dumazet
    Cc: Hannes Frederic Sowa
    Fixes: 222e83d2e0ae ("tcp: switch tcp_fastopen key generation to net_get_random_once")
    Signed-off-by: Christoph Paasch
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Christoph Paasch
     

23 May, 2015

1 commit

  • Taking socket spinlock in tcp_get_info() can deadlock, as
    inet_diag_dump_icsk() holds the &hashinfo->ehash_locks[i],
    while packet processing can use the reverse locking order.

    We could avoid this locking for TCP_LISTEN states, but lockdep would
    certainly get confused as all TCP sockets share same lockdep classes.

    [ 523.722504] ======================================================
    [ 523.728706] [ INFO: possible circular locking dependency detected ]
    [ 523.734990] 4.1.0-dbg-DEV #1676 Not tainted
    [ 523.739202] -------------------------------------------------------
    [ 523.745474] ss/18032 is trying to acquire lock:
    [ 523.750002] (slock-AF_INET){+.-...}, at: [] tcp_get_info+0x2c4/0x360
    [ 523.758129]
    [ 523.758129] but task is already holding lock:
    [ 523.763968] (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [] inet_diag_dump_icsk+0x1d5/0x6c0
    [ 523.774661]
    [ 523.774661] which lock already depends on the new lock.
    [ 523.774661]
    [ 523.782850]
    [ 523.782850] the existing dependency chain (in reverse order) is:
    [ 523.790326]
    -> #1 (&(&hashinfo->ehash_locks[i])->rlock){+.-...}:
    [ 523.796599] [] lock_acquire+0xbb/0x270
    [ 523.802565] [] _raw_spin_lock+0x38/0x50
    [ 523.808628] [] __inet_hash_nolisten+0x78/0x110
    [ 523.815273] [] tcp_v4_syn_recv_sock+0x24b/0x350
    [ 523.822067] [] tcp_check_req+0x3c1/0x500
    [ 523.828199] [] tcp_v4_do_rcv+0x239/0x3d0
    [ 523.834331] [] tcp_v4_rcv+0xa8e/0xc10
    [ 523.840202] [] ip_local_deliver_finish+0x133/0x3e0
    [ 523.847214] [] ip_local_deliver+0xaa/0xc0
    [ 523.853440] [] ip_rcv_finish+0x168/0x5c0
    [ 523.859624] [] ip_rcv+0x307/0x420

    Lets use u64_sync infrastructure instead. As a bonus, 64bit
    arches get optimized, as these are nop for them.

    Fixes: 0df48c26d841 ("tcp: add tcpi_bytes_acked to tcp_info")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Apr, 2015

1 commit

  • This patch tracks total number of payload bytes received on a TCP socket.
    This is the sum of all changes done to tp->rcv_nxt

    RFC4898 named this : tcpEStatsAppHCThruOctetsReceived

    This is a 64bit field, and can be fetched both from TCP_INFO
    getsockopt() if one has a handle on a TCP socket, or from inet_diag
    netlink facility (iproute2/ss patch will follow)

    Note that tp->bytes_received was placed near tp->rcv_nxt for
    best data locality and minimal performance impact.

    Signed-off-by: Eric Dumazet
    Cc: Yuchung Cheng
    Cc: Matt Mathis
    Cc: Eric Salo
    Cc: Martin Lau
    Cc: Chris Rapier
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Apr, 2015

1 commit

  • Fast Open has been using the experimental option with a magic number
    (RFC6994) to request and grant Fast Open cookies. This patch enables
    the server to support the official IANA option 34 in RFC7413 in
    addition.

    The change has passed all existing Fast Open tests with both
    old and new options at Google.

    Signed-off-by: Daniel Lee
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Daniel Lee