13 Dec, 2011

3 commits

  • This patch allows each namespace to independently set up
    its levels for tcp memory pressure thresholds. This patch
    alone does not buy much: we need to make this values
    per group of process somehow. This is achieved in the
    patches that follows in this patchset.

    Signed-off-by: Glauber Costa
    Reviewed-by: KAMEZAWA Hiroyuki
    CC: David S. Miller
    CC: Eric W. Biederman
    Signed-off-by: David S. Miller

    Glauber Costa
     
  • This patch introduces memory pressure controls for the tcp
    protocol. It uses the generic socket memory pressure code
    introduced in earlier patches, and fills in the
    necessary data in cg_proto struct.

    Signed-off-by: Glauber Costa
    Reviewed-by: KAMEZAWA Hiroyuki
    CC: Eric W. Biederman
    Signed-off-by: David S. Miller

    Glauber Costa
     
  • This patch replaces all uses of struct sock fields' memory_pressure,
    memory_allocated, sockets_allocated, and sysctl_mem to acessor
    macros. Those macros can either receive a socket argument, or a mem_cgroup
    argument, depending on the context they live in.

    Since we're only doing a macro wrapping here, no performance impact at all is
    expected in the case where we don't have cgroups disabled.

    Signed-off-by: Glauber Costa
    Reviewed-by: Hiroyouki Kamezawa
    CC: David S. Miller
    CC: Eric W. Biederman
    CC: Eric Dumazet
    Signed-off-by: David S. Miller

    Glauber Costa
     

01 Dec, 2011

1 commit

  • Rick Jones reported that TCP_CONGESTION sockopt performed on a listener
    was ignored for its children sockets : right after accept() the
    congestion control for new socket is the system default one.

    This seems an oversight of the initial design (quoted from Stephen)

    Based on prior investigation and patch from Rick.

    Reported-by: Rick Jones
    Signed-off-by: Eric Dumazet
    CC: Stephen Hemminger
    CC: Yuchung Cheng
    Tested-by: Rick Jones
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Nov, 2011

1 commit

  • Simon Kirby reported divides by zero errors in __tcp_select_window()

    This happens when inet_csk_route_child_sock() returns a NULL pointer :

    We free new socket while we eventually armed keepalive timer in
    tcp_create_openreq_child()

    Fix this by a call to tcp_clear_xmit_timers()

    [ This is a followup to commit 918eb39962dff (net: add missing
    bh_unlock_sock() calls) ]

    Reported-by: Simon Kirby
    Signed-off-by: Eric Dumazet
    Tested-by: Simon Kirby
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Nov, 2011

1 commit

  • Simon Kirby reported lockdep warnings and following messages :

    [104661.897577] huh, entered softirq 3 NET_RX ffffffff81613740
    preempt_count 00000101, exited with 00000102?

    [104661.923653] huh, entered softirq 3 NET_RX ffffffff81613740
    preempt_count 00000101, exited with 00000102?

    Problem comes from commit 0e734419
    (ipv4: Use inet_csk_route_child_sock() in DCCP and TCP.)

    If inet_csk_route_child_sock() returns NULL, we should release socket
    lock before freeing it.

    Another lock imbalance exists if __inet_inherit_port() returns an error
    since commit 093d282321da ( tproxy: fix hash locking issue when using
    port redirection in __inet_inherit_port()) a backport is also needed for
    >= 2.6.37 kernels.

    Reported-by: Simon Kirby
    Signed-off-by: Eric Dumazet
    Tested-by: Eric Dumazet
    CC: Balazs Scheidler
    CC: KOVACS Krisztian
    Reviewed-by: Thomas Gleixner
    Tested-by: Simon Kirby
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Nov, 2011

1 commit

  • the tcp and udp code creates a set of struct file_operations at runtime
    while it can also be done at compile time, with the added benefit of then
    having these file operations be const.

    the trickiest part was to get the "THIS_MODULE" reference right; the naive
    method of declaring a struct in the place of registration would not work
    for this reason.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: David S. Miller

    Arjan van de Ven
     

24 Oct, 2011

2 commits

  • There is a long standing bug in linux tcp stack, about ACK messages sent
    on behalf of TIME_WAIT sockets.

    In the IP header of the ACK message, we choose to reflect TOS field of
    incoming message, and this might break some setups.

    Example of things that were broken :
    - Routing using TOS as a selector
    - Firewalls
    - Trafic classification / shaping

    We now remember in timewait structure the inet tos field and use it in
    ACK generation, and route lookup.

    Notes :
    - We still reflect incoming TOS in RST messages.
    - We could extend MuraliRaja Muniraju patch to report TOS value in
    netlink messages for TIME_WAIT sockets.
    - A patch is needed for IPv6

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Now tcp_md5_hash_header() has a const tcphdr argument, we can add more
    const attributes to callers.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Oct, 2011

1 commit

  • Adding const qualifiers to pointers can ease code review, and spot some
    bugs. It might allow compiler to optimize code further.

    For example, is it legal to temporary write a null cksum into tcphdr
    in tcp_md5_hash_header() ? I am afraid a sniffer could catch the
    temporary null value...

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Oct, 2011

1 commit


05 Oct, 2011

1 commit

  • tcp_v4_clear_md5_list() assumes that multiple tcp md5sig peers
    only hold one reference to md5sig_pool. but tcp_v4_md5_do_add()
    increases use count of md5sig_pool for each peer. This patch
    makes tcp_v4_md5_do_add() only increases use count for the first
    tcp md5sig peer.

    Signed-off-by: Zheng Yan
    Signed-off-by: David S. Miller

    Yan, Zheng
     

27 Sep, 2011

1 commit

  • struct tcp_skb_cb contains a "flags" field containing either tcp flags
    or IP dsfield depending on context (input or output path)

    Introduce ip_dsfield to make the difference clear and ease maintenance.
    If later we want to save space, we can union flags/ip_dsfield

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Sep, 2011

1 commit

  • Conflicts:
    MAINTAINERS
    drivers/net/Kconfig
    drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
    drivers/net/ethernet/broadcom/tg3.c
    drivers/net/wireless/iwlwifi/iwl-pci.c
    drivers/net/wireless/iwlwifi/iwl-trans-tx-pcie.c
    drivers/net/wireless/rt2x00/rt2800usb.c
    drivers/net/wireless/wl12xx/main.c

    David S. Miller
     

16 Sep, 2011

1 commit

  • "Possible SYN flooding on port xxxx " messages can fill logs on servers.

    Change logic to log the message only once per listener, and add two new
    SNMP counters to track :

    TCPReqQFullDoCookies : number of times a SYNCOOKIE was replied to client

    TCPReqQFullDrop : number of times a SYN request was dropped because
    syncookies were not enabled.

    Based on a prior patch from Tom Herbert, and suggestions from David.

    Signed-off-by: Eric Dumazet
    CC: Tom Herbert
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Aug, 2011

1 commit

  • The l4_rxhash flag was added to the skb structure to indicate
    that the rxhash value was computed over the 4 tuple for the
    packet which includes the port information in the encapsulated
    transport packet. This is used by the stack to preserve the
    rxhash value in __skb_rx_tunnel.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

07 Aug, 2011

1 commit

  • Computers have become a lot faster since we compromised on the
    partial MD4 hash which we use currently for performance reasons.

    MD5 is a much safer choice, and is inline with both RFC1948 and
    other ISS generators (OpenBSD, Solaris, etc.)

    Furthermore, only having 24-bits of the sequence number be truly
    unpredictable is a very serious limitation. So the periodic
    regeneration and 8-bit counter have been removed. We compute and
    use a full 32-bit sequence number.

    For ipv6, DCCP was found to use a 32-bit truncated initial sequence
    number (it needs 43-bits) and that is fixed here as well.

    Reported-by: Dan Kaminsky
    Tested-by: Willy Tarreau
    Signed-off-by: David S. Miller

    David S. Miller
     

21 Jun, 2011

1 commit


18 Jun, 2011

1 commit

  • Le jeudi 16 juin 2011 à 23:38 -0400, David Miller a écrit :
    > From: Ben Hutchings
    > Date: Fri, 17 Jun 2011 00:50:46 +0100
    >
    > > On Wed, 2011-06-15 at 04:15 +0200, Eric Dumazet wrote:
    > >> @@ -1594,6 +1594,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
    > >> goto discard;
    > >>
    > >> if (nsk != sk) {
    > >> + sock_rps_save_rxhash(nsk, skb->rxhash);
    > >> if (tcp_child_process(sk, nsk, skb)) {
    > >> rsk = nsk;
    > >> goto reset;
    > >>
    > >
    > > I haven't tried this, but it looks reasonable to me.
    > >
    > > What about IPv6? The logic in tcp_v6_do_rcv() looks very similar.
    >
    > Indeed ipv6 side needs the same fix.
    >
    > Eric please add that part and resubmit. And in fact I might stick
    > this into net-2.6 instead of net-next-2.6
    >

    OK, here is the net-2.6 based one then, thanks !

    [PATCH v2] net: rfs: enable RFS before first data packet is received

    First packet received on a passive tcp flow is not correctly RFS
    steered.

    One sock_rps_record_flow() call is missing in inet_accept()

    But before that, we also must record rxhash when child socket is setup.

    Signed-off-by: Eric Dumazet
    CC: Tom Herbert
    CC: Ben Hutchings
    CC: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Jun, 2011

1 commit

  • This patch lowers the default initRTO from 3secs to 1sec per
    RFC2988bis. It falls back to 3secs if the SYN or SYN-ACK packet
    has been retransmitted, AND the TCP timestamp option is not on.

    It also adds support to take RTT sample during 3WHS on the passive
    open side, just like its active open counterpart, and uses it, if
    valid, to seed the initRTO for the data transmission phase.

    The patch also resets ssthresh to its initial default at the
    beginning of the data transmission phase, and reduces cwnd to 1 if
    there has been MORE THAN ONE retransmission during 3WHS per RFC5681.

    Signed-off-by: H.K. Jerry Chu
    Signed-off-by: David S. Miller

    Jerry Chu
     

24 May, 2011

1 commit

  • The %pK format specifier is designed to hide exposed kernel pointers,
    specifically via /proc interfaces. Exposing these pointers provides an
    easy target for kernel write vulnerabilities, since they reveal the
    locations of writable structures containing easily triggerable function
    pointers. The behavior of %pK depends on the kptr_restrict sysctl.

    If kptr_restrict is set to 0, no deviation from the standard %p behavior
    occurs. If kptr_restrict is set to 1, the default, if the current user
    (intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
    (currently in the LSM tree), kernel pointers using %pK are printed as 0's.
    If kptr_restrict is set to 2, kernel pointers using %pK are printed as
    0's regardless of privileges. Replacing with 0's was chosen over the
    default "(null)", which cannot be parsed by userland %p, which expects
    "(nil)".

    The supporting code for kptr_restrict and %pK are currently in the -mm
    tree. This patch converts users of %p in net/ to %pK. Cases of printing
    pointers to the syslog are not covered, since this would eliminate useful
    information for postmortem debugging and the reading of the syslog is
    already optionally protected by the dmesg_restrict sysctl.

    Signed-off-by: Dan Rosenberg
    Cc: James Morris
    Cc: Eric Dumazet
    Cc: Thomas Graf
    Cc: Eugene Teo
    Cc: Kees Cook
    Cc: Ingo Molnar
    Cc: David S. Miller
    Cc: Peter Zijlstra
    Cc: Eric Paris
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Dan Rosenberg
     

19 May, 2011

3 commits


11 May, 2011

1 commit


09 May, 2011

3 commits


29 Apr, 2011

3 commits

  • Now that output route lookups update the flow with
    destination address selection, we can fetch it from
    fl4->daddr instead of rt->rt_dst

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Now that output route lookups update the flow with
    source address selection, we can fetch it from
    fl4->saddr instead of rt->rt_src

    Signed-off-by: David S. Miller

    David S. Miller
     
  • We lack proper synchronization to manipulate inet->opt ip_options

    Problem is ip_make_skb() calls ip_setup_cork() and
    ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
    without any protection against another thread manipulating inet->opt.

    Another thread can change inet->opt pointer and free old one under us.

    Use RCU to protect inet->opt (changed to inet->inet_opt).

    Instead of handling atomic refcounts, just copy ip_options when
    necessary, to avoid cache line dirtying.

    We cant insert an rcu_head in struct ip_options since its included in
    skb->cb[], so this patch is large because I had to introduce a new
    ip_options_rcu structure.

    Signed-off-by: Eric Dumazet
    Cc: Herbert Xu
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Apr, 2011

1 commit

  • These functions are used together as a unit for route resolution
    during connect(). They address the chicken-and-egg problem that
    exists when ports need to be allocated during connect() processing,
    yet such port allocations require addressing information from the
    routing code.

    It's currently more heavy handed than it needs to be, and in
    particular we allocate and initialize a flow object twice.

    Let the callers provide the on-stack flow object. That way we only
    need to initialize it once in the ip_route_connect() call.

    Later, if ip_route_newports() needs to do anything, it re-uses that
    flow object as-is except for the ports which it updates before the
    route re-lookup.

    Also, describe why this set of facilities are needed and how it works
    in a big comment.

    Signed-off-by: David S. Miller
    Reviewed-by: Eric Dumazet

    David S. Miller
     

23 Apr, 2011

1 commit


03 Mar, 2011

1 commit


02 Mar, 2011

1 commit


25 Feb, 2011

1 commit

  • ip_route_newports() is the only place in the entire kernel that
    cares about the port members in the routing cache entry's lookup
    flow key.

    Therefore the only reason we store an entire flow inside of the
    struct rtentry is for this one special case.

    Rewrite ip_route_newports() such that:

    1) The caller passes in the original port values, so we don't need
    to use the rth->fl.fl_ip_{s,d}port values to remember them.

    2) The lookup flow is constructed by hand instead of being copied
    from the routing cache entry's flow.

    Signed-off-by: David S. Miller

    David S. Miller
     

21 Feb, 2011

1 commit


11 Feb, 2011

1 commit


25 Jan, 2011

1 commit

  • commit a8b690f98baf9fb19 (tcp: Fix slowness in read /proc/net/tcp)
    introduced a bug in handling of SYN_RECV sockets.

    st->offset represents number of sockets found since beginning of
    listening_hash[st->bucket].

    We should not reset st->offset when iterating through
    syn_table[st->sbucket], or else if more than ~25 sockets (if
    PAGE_SIZE=4096) are in SYN_RECV state, we exit from listening_get_next()
    with a too small st->offset

    Next time we enter tcp_seek_last_pos(), we are not able to seek past
    already found sockets.

    Reported-by: PK
    CC: Tom Herbert
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Dec, 2010

1 commit