02 Sep, 2016

1 commit


24 Apr, 2016

1 commit


09 Mar, 2016

1 commit


24 Feb, 2016

1 commit


08 Feb, 2016

1 commit


29 Aug, 2015

3 commits


10 Jul, 2015

1 commit

  • Add a helper to test the slow start condition in various congestion
    control modules and other places. This is to prepare a slight improvement
    in policy as to exactly when to slow start.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: Nandita Dukkipati
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

08 Apr, 2015

1 commit

  • Fast Open has been using an experimental option with a magic number
    (RFC6994). This patch makes the client by default use the RFC7413
    option (34) to get and send Fast Open cookies. This patch makes
    the client solicit cookies from a given server first with the
    RFC7413 option. If that fails to elicit a cookie, then it tries
    the RFC6994 experimental option. If that also fails, it uses the
    RFC7413 option on all subsequent connect attempts. If the server
    returns a Fast Open cookie then the client caches the form of the
    option that successfully elicited a cookie, and uses that form on
    later connects when it presents that cookie.

    The idea is to gradually obsolete the use of experimental options as
    the servers and clients upgrade, while keeping the interoperability
    meanwhile.

    Signed-off-by: Daniel Lee
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Daniel Lee
     

04 Apr, 2015

1 commit

  • The ipv4 code uses a mixture of coding styles. In some instances check
    for NULL pointer is done as x == NULL and sometimes as !x. !x is
    preferred according to checkpatch and this patch makes the code
    consistent by adopting the latter form.

    No changes detected by objdiff.

    Signed-off-by: Ian Morris
    Signed-off-by: David S. Miller

    Ian Morris
     

01 Apr, 2015

3 commits


17 Mar, 2015

1 commit

  • Changes in tcp_metric hash table are protected by tcp_metrics_lock
    only, not by genl_mutex

    While we are at it use deref_locked() instead of rcu_dereference()
    in tcp_new() to avoid unnecessary barrier, as we hold tcp_metrics_lock
    as well.

    Reported-by: Andrew Vagin
    Signed-off-by: Eric Dumazet
    Fixes: 098a697b497e ("tcp_metrics: Use a single hash table for all network namespaces.")
    Reviewed-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Mar, 2015

6 commits


18 Jan, 2015

1 commit

  • Contrary to common expectations for an "int" return, these functions
    return only a positive value -- if used correctly they cannot even
    return 0 because the message header will necessarily be in the skb.

    This makes the very common pattern of

    if (genlmsg_end(...) < 0) { ... }

    be a whole bunch of dead code. Many places also simply do

    return nlmsg_end(...);

    and the caller is expected to deal with it.

    This also commonly (at least for me) causes errors, because it is very
    common to write

    if (my_function(...))
    /* error condition */

    and if my_function() does "return nlmsg_end()" this is of course wrong.

    Additionally, there's not a single place in the kernel that actually
    needs the message length returned, and if anyone needs it later then
    it'll be very easy to just use skb->len there.

    Remove this, and make the functions void. This removes a bunch of dead
    code as described above. The patch adds lines because I did

    - return nlmsg_end(...);
    + nlmsg_end(...);
    + return 0;

    I could have preserved all the function's return values by returning
    skb->len, but instead I've audited all the places calling the affected
    functions and found that none cared. A few places actually compared
    the return value with < 0 with no change in behaviour, so I opted for the more
    efficient version.

    One instance of the error I've made numerous times now is also present
    in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
    check for
    Signed-off-by: David S. Miller

    Johannes Berg
     

15 Aug, 2014

1 commit

  • tcp_tw_recycle heavily relies on tcp timestamps to build a per-host
    ordering of incoming connections and teardowns without the need to
    hold state on a specific quadruple for TCP_TIMEWAIT_LEN, but only for
    the last measured RTO. To do so, we keep the last seen timestamp in a
    per-host indexed data structure and verify if the incoming timestamp
    in a connection request is strictly greater than the saved one during
    last connection teardown. Thus we can verify later on that no old data
    packets will be accepted by the new connection.

    During moving a socket to time-wait state we already verify if timestamps
    where seen on a connection. Only if that was the case we let the
    time-wait socket expire after the RTO, otherwise normal TCP_TIMEWAIT_LEN
    will be used. But we don't verify this on incoming SYN packets. If a
    connection teardown was less than TCP_PAWS_MSL seconds in the past we
    cannot guarantee to not accept data packets from an old connection if
    no timestamps are present. We should drop this SYN packet. This patch
    closes this loophole.

    Please note, this patch does not make tcp_tw_recycle in any way more
    usable but only adds another safety check:
    Sporadic drops of SYN packets because of reordering in the network or
    in the socket backlog queues can happen. Users behing NAT trying to
    connect to a tcp_tw_recycle enabled server can get caught in blackholes
    and their connection requests may regullary get dropped because hosts
    behind an address translator don't have synchronized tcp timestamp clocks.
    tcp_tw_recycle cannot work if peers don't have tcp timestamps enabled.

    In general, use of tcp_tw_recycle is disadvised.

    Cc: Eric Dumazet
    Cc: Florian Westphal
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

01 Aug, 2014

1 commit

  • commit d23ff7016 (tcp: add generic netlink support for tcp_metrics) introduced
    netlink support for the new tcp_metrics, however it restricted getting of
    tcp_metrics to root user only. This is a change from how these values could
    have been fetched when in the old route cache. Unless there's a legitimate
    reason to restrict the reading of these values it would be better if normal
    users could fetch them.

    Cc: Julian Anastasov
    Cc: linux-kernel@vger.kernel.org

    Signed-off-by: Debabrata Banerjee
    Signed-off-by: David S. Miller

    Banerjee, Debabrata
     

05 Jun, 2014

1 commit


27 Feb, 2014

1 commit

  • Upcoming congestion controls for TCP require usec resolution for RTT
    estimations. Millisecond resolution is simply not enough these days.

    FQ/pacing in DC environments also require this change for finer control
    and removal of bimodal behavior due to the current hack in
    tcp_update_pacing_rate() for 'small rtt'

    TCP_CONG_RTT_STAMP is no longer needed.

    As Julian Anastasov pointed out, we need to keep user compatibility :
    tcp_metrics used to export RTT and RTTVAR in msec resolution,
    so we added RTT_US and RTTVAR_US. An iproute2 patch is needed
    to use the new attributes if provided by the kernel.

    In this example ss command displays a srtt of 32 usecs (10Gbit link)

    lpk51:~# ./ss -i dst lpk52
    Netid State Recv-Q Send-Q Local Address:Port Peer
    Address:Port
    tcp ESTAB 0 1 10.246.11.51:42959
    10.246.11.52:64614
    cubic wscale:6,6 rto:201 rtt:0.032/0.001 ato:40 mss:1448
    cwnd:10 send
    3620.0Mbps pacing_rate 7240.0Mbps unacked:1 rcv_rtt:993 rcv_space:29559

    Updated iproute2 ip command displays :

    lpk51:~# ./ip tcp_metrics | grep 10.246.11.52
    10.246.11.52 age 561.914sec cwnd 10 rtt 274us rttvar 213us source
    10.246.11.51

    Old binary displays :

    lpk51:~# ip tcp_metrics | grep 10.246.11.52
    10.246.11.52 age 561.914sec cwnd 10 rtt 250us rttvar 125us source
    10.246.11.51

    With help from Julian Anastasov, Stephen Hemminger and Yuchung Cheng

    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Cc: Stephen Hemminger
    Cc: Yuchung Cheng
    Cc: Larry Brakmo
    Cc: Julian Anastasov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Jan, 2014

1 commit

  • A socket may be v6/v4-mapped. In that case sk->sk_family is AF_INET6,
    but the IP being used is actually an IPv4-address.
    Current's tcp-metrics will thus represent it as an IPv6-address:

    root@server:~# ip tcp_metrics
    ::ffff:10.1.1.2 age 22.920sec rtt 18750us rttvar 15000us cwnd 10
    10.1.1.2 age 47.970sec rtt 16250us rttvar 10000us cwnd 10

    This patch modifies the tcp-metrics so that they are able to handle the
    v6/v4-mapped sockets correctly.

    Signed-off-by: Christoph Paasch
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Christoph Paasch
     

23 Jan, 2014

1 commit

  • In bbf852b96ebdc6d1 I introduced the tmlist, which allows to delete
    multiple entries from the cache that match a specified destination if no
    source-IP is specified.

    However, as the cache is an RCU-list, we should not create this tmlist, as
    it will change the tcpm_next pointer of the element that will be deleted
    and so a thread iterating over the cache's entries while holding the
    RCU-lock might get "redirected" to this tmlist.

    This patch fixes this, by reverting back to the old behavior prior to
    bbf852b96ebdc6d1, which means that we simply change the tcpm_next
    pointer of the previous element (pp) to jump over the one we are
    deleting.
    The difference is that we call kfree_rcu() directly on the cache entry,
    which allows us to delete multiple entries from the list.

    Fixes: bbf852b96ebdc6d1 (tcp: metrics: Delete all entries matching a certain destination)
    Signed-off-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Christoph Paasch
     

18 Jan, 2014

2 commits

  • Conflicts:
    drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
    net/ipv4/tcp_metrics.c

    Overlapping changes between the "don't create two tcp metrics objects
    with the same key" race fix in net and the addition of the destination
    address in the lookup key in net-next.

    Minor overlapping changes in bnx2x driver.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Because the tcp-metrics is an RCU-list, it may be that two
    soft-interrupts are inside __tcp_get_metrics() for the same
    destination-IP at the same time. If this destination-IP is not yet part of
    the tcp-metrics, both soft-interrupts will end up in tcpm_new and create
    a new entry for this IP.
    So, we will have two tcp-metrics with the same destination-IP in the list.

    This patch checks twice __tcp_get_metrics(). First without holding the
    lock, then while holding the lock. The second one is there to confirm
    that the entry has not been added by another soft-irq while waiting for
    the spin-lock.

    Fixes: 51c5d0c4b169b (tcp: Maintain dynamic metrics in local cache.)
    Signed-off-by: Christoph Paasch
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Christoph Paasch
     

11 Jan, 2014

5 commits


20 Nov, 2013

1 commit

  • As suggested by David Miller, make genl_register_family_with_ops()
    a macro and pass only the array, evaluating ARRAY_SIZE() in the
    macro, this is a little safer.

    The openvswitch has some indirection, assing ops/n_ops directly in
    that code. This might ultimately just assign the pointers in the
    family initializations, saving the struct genl_family_and_ops and
    code (once mcast groups are handled differently.)

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     

15 Nov, 2013

2 commits

  • Now that genl_ops are no longer modified in place when
    registering, they can be made const. This patch was done
    mostly with spatch:

    @@
    identifier ops;
    @@
    +const
    struct genl_ops ops[] = {
    ...
    };

    (except the struct thing in net/openvswitch/datapath.c)

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • We had some reports of crashes using TCP fastopen, and Dave Jones
    gave a nice stack trace pointing to the error.

    Issue is that tcp_get_metrics() should not be called with a NULL dst

    Fixes: 1fe4c481ba637 ("net-tcp: Fast Open client - cookie cache")
    Signed-off-by: Eric Dumazet
    Reported-by: Dave Jones
    Cc: Yuchung Cheng
    Acked-by: Yuchung Cheng
    Tested-by: Dave Jones
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Oct, 2013

1 commit

  • Fast Open currently has a fall back feature to address SYN-data being
    dropped but it requires the middle-box to pass on regular SYN retry
    after SYN-data. This is implemented in commit aab487435 ("net-tcp:
    Fast Open client - detecting SYN-data drops")

    However some NAT boxes will drop all subsequent packets after first
    SYN-data and blackholes the entire connections. An example is in
    commit 356d7d8 "netfilter: nf_conntrack: fix tcp_in_window for Fast
    Open".

    The sender should note such incidents and fall back to use the regular
    TCP handshake on subsequent attempts temporarily as well: after the
    second SYN timeouts the original Fast Open SYN is most likely lost.
    When such an event recurs Fast Open is disabled based on the number of
    recurrences exponentially.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

10 Oct, 2013

1 commit

  • TCP listener refactoring, part 5 :

    We want to be able to insert request sockets (SYN_RECV) into main
    ehash table instead of the per listener hash table to allow RCU
    lookups and remove listener lock contention.

    This patch includes the needed struct sock_common in front
    of struct request_sock

    This means there is no more inet6_request_sock IPv6 specific
    structure.

    Following inet_request_sock fields were renamed as they became
    macros to reference fields from struct sock_common.
    Prefix ir_ was chosen to avoid name collisions.

    loc_port -> ir_loc_port
    loc_addr -> ir_loc_addr
    rmt_addr -> ir_rmt_addr
    rmt_port -> ir_rmt_port
    iif -> ir_iif

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet