23 Apr, 2012

1 commit

  • Commit f5fff5d forgot to fix TCP_MAXSEG behavior IPv6 sockets, so IPv6
    TCP server sockets that used TCP_MAXSEG would find that the advmss of
    child sockets would be incorrect. This commit mirrors the advmss logic
    from tcp_v4_syn_recv_sock in tcp_v6_syn_recv_sock. Eventually this
    logic should probably be shared between IPv4 and IPv6, but this at
    least fixes this issue.

    Signed-off-by: Neal Cardwell
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     

11 Apr, 2012

1 commit

  • Pull dmaengine fixes from Dan Williams:

    1/ regression fix for Xen as it now trips over a broken assumption
    about the dma address size on 32-bit builds

    2/ new quirk for netdma to ignore dma channels that cannot meet
    netdma alignment requirements

    3/ fixes for two long standing issues in ioatdma (ring size overflow)
    and iop-adma (potential stack corruption)

    * tag 'dmaengine-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine:
    netdma: adding alignment check for NETDMA ops
    ioatdma: DMA copy alignment needed to address IOAT DMA silicon errata
    ioat: ring size variables need to be 32bit to avoid overflow
    iop-adma: Corrected array overflow in RAID6 Xscale(R) test.
    ioat: fix size of 'completion' for Xen

    Linus Torvalds
     

06 Apr, 2012

1 commit


13 Feb, 2012

1 commit

  • Currently, it is not easily possible to get TOS/DSCP value of packets from
    an incoming TCP stream. The mechanism is there, IP_PKTOPTIONS getsockopt
    with IP_RECVTOS set, the same way as incoming TTL can be queried. This is
    not actually implemented for TOS, though.

    This patch adds this functionality, both for IPv4 (IP_PKTOPTIONS) and IPv6
    (IPV6_2292PKTOPTIONS). For IPv4, like in the IP_RECVTTL case, the value of
    the TOS field is stored from the other party's ACK.

    This is needed for proxies which require DSCP transparency. One such example
    is at http://zph.bratcheda.org/.

    Signed-off-by: Jiri Benc
    Signed-off-by: David S. Miller

    Jiri Benc
     

02 Feb, 2012

1 commit

  • TCP RST mechanism is broken in TCP md5(RFC2385). When
    connection is gone, md5 key is lost, sending RST
    without md5 hash is deem to ignored by peer. This can
    be a problem since RST help protocal like bgp to fast
    recove from peer crash.

    In most case, users of tcp md5, such as bgp and ldp,
    have listener on both sides to accept connection from peer.
    md5 keys for peers are saved in listening socket.

    There are two cases in finding md5 key when connection is
    lost:
    1.Passive receive RST: The message is send to well known port,
    tcp will associate it with listner. md5 key is gotten from
    listener.

    2.Active receive RST (no sock): The message is send to ative
    side, there is no socket associated with the message. In this
    case, finding listener from source port, then find md5 key from
    listener.

    we are not loosing sercuriy here:
    packet is checked with md5 hash. No RST is generated
    if md5 hash doesn't match or no md5 key can be found.

    Signed-off-by: Shawn Lu
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Shawn Lu
     

01 Feb, 2012

3 commits

  • This patch makes sure we use appropriate memory barriers before
    publishing tp->md5sig_info, allowing tcp_md5_do_lookup() being used from
    tcp_v4_send_reset() without holding socket lock (upcoming patch from
    Shawn Lu)

    Note we also need to respect rcu grace period before its freeing, since
    we can free socket without this grace period thanks to
    SLAB_DESTROY_BY_RCU

    Signed-off-by: Eric Dumazet
    Cc: Shawn Lu
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • In order to be able to support proper RST messages for TCP MD5 flows, we
    need to allow access to MD5 keys without locking listener socket.

    This conversion is a nice cleanup, and shrinks size of timewait sockets
    by 80 bytes.

    IPv6 code reuses generic code found in IPv4 instead of duplicating it.

    Control path uses GFP_KERNEL allocations instead of GFP_ATOMIC.

    Signed-off-by: Eric Dumazet
    Cc: Shawn Lu
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • We no longer use md5_add() method from struct tcp_sock_af_ops

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Jan, 2012

1 commit


13 Dec, 2011

3 commits

  • This patch allows each namespace to independently set up
    its levels for tcp memory pressure thresholds. This patch
    alone does not buy much: we need to make this values
    per group of process somehow. This is achieved in the
    patches that follows in this patchset.

    Signed-off-by: Glauber Costa
    Reviewed-by: KAMEZAWA Hiroyuki
    CC: David S. Miller
    CC: Eric W. Biederman
    Signed-off-by: David S. Miller

    Glauber Costa
     
  • This patch introduces memory pressure controls for the tcp
    protocol. It uses the generic socket memory pressure code
    introduced in earlier patches, and fills in the
    necessary data in cg_proto struct.

    Signed-off-by: Glauber Costa
    Reviewed-by: KAMEZAWA Hiroyuki
    CC: Eric W. Biederman
    Signed-off-by: David S. Miller

    Glauber Costa
     
  • This patch replaces all uses of struct sock fields' memory_pressure,
    memory_allocated, sockets_allocated, and sysctl_mem to acessor
    macros. Those macros can either receive a socket argument, or a mem_cgroup
    argument, depending on the context they live in.

    Since we're only doing a macro wrapping here, no performance impact at all is
    expected in the case where we don't have cgroups disabled.

    Signed-off-by: Glauber Costa
    Reviewed-by: Hiroyouki Kamezawa
    CC: David S. Miller
    CC: Eric W. Biederman
    CC: Eric Dumazet
    Signed-off-by: David S. Miller

    Glauber Costa
     

27 Nov, 2011

1 commit


24 Nov, 2011

1 commit

  • Since linux 2.6.26 (commit c6aefafb7ec6 : Add IPv6 support to TCP SYN
    cookies), we can drop a SYN packet reusing a TIME_WAIT socket.

    (As a matter of fact we fail to send the SYNACK answer)

    As the client resends its SYN packet after a one second timeout, we
    accept it, because first packet removed the TIME_WAIT socket before
    being dropped.

    This probably explains why nobody ever noticed or complained.

    Reported-by: Jesse Young
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Nov, 2011

1 commit


02 Nov, 2011

1 commit

  • the tcp and udp code creates a set of struct file_operations at runtime
    while it can also be done at compile time, with the added benefit of then
    having these file operations be const.

    the trickiest part was to get the "THIS_MODULE" reference right; the naive
    method of declaring a struct in the place of registration would not work
    for this reason.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: David S. Miller

    Arjan van de Ven
     

27 Oct, 2011

1 commit

  • commit 66b13d99d96a (ipv4: tcp: fix TOS value in ACK messages sent from
    TIME_WAIT) fixed IPv4 only.

    This part is for the IPv6 side, adding a tclass param to ip6_xmit()

    We alias tw_tclass and tw_tos, if socket family is INET6.

    [ if sockets is ipv4-mapped, only IP_TOS socket option is used to fill
    TOS field, TCLASS is not taken into account ]

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Oct, 2011

1 commit


21 Oct, 2011

1 commit

  • Adding const qualifiers to pointers can ease code review, and spot some
    bugs. It might allow compiler to optimize code further.

    For example, is it legal to temporary write a null cksum into tcphdr
    in tcp_md5_hash_header() ? I am afraid a sniffer could catch the
    temporary null value...

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Oct, 2011

1 commit


05 Oct, 2011

1 commit

  • tcp_v4_clear_md5_list() assumes that multiple tcp md5sig peers
    only hold one reference to md5sig_pool. but tcp_v4_md5_do_add()
    increases use count of md5sig_pool for each peer. This patch
    makes tcp_v4_md5_do_add() only increases use count for the first
    tcp md5sig peer.

    Signed-off-by: Zheng Yan
    Signed-off-by: David S. Miller

    Yan, Zheng
     

29 Sep, 2011

1 commit


27 Sep, 2011

1 commit

  • struct tcp_skb_cb contains a "flags" field containing either tcp flags
    or IP dsfield depending on context (input or output path)

    Introduce ip_dsfield to make the difference clear and ease maintenance.
    If later we want to save space, we can union flags/ip_dsfield

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Sep, 2011

1 commit

  • Conflicts:
    MAINTAINERS
    drivers/net/Kconfig
    drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
    drivers/net/ethernet/broadcom/tg3.c
    drivers/net/wireless/iwlwifi/iwl-pci.c
    drivers/net/wireless/iwlwifi/iwl-trans-tx-pcie.c
    drivers/net/wireless/rt2x00/rt2800usb.c
    drivers/net/wireless/wl12xx/main.c

    David S. Miller
     

16 Sep, 2011

1 commit

  • "Possible SYN flooding on port xxxx " messages can fill logs on servers.

    Change logic to log the message only once per listener, and add two new
    SNMP counters to track :

    TCPReqQFullDoCookies : number of times a SYNCOOKIE was replied to client

    TCPReqQFullDrop : number of times a SYN request was dropped because
    syncookies were not enabled.

    Based on a prior patch from Tom Herbert, and suggestions from David.

    Signed-off-by: Eric Dumazet
    CC: Tom Herbert
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Aug, 2011

1 commit

  • The l4_rxhash flag was added to the skb structure to indicate
    that the rxhash value was computed over the 4 tuple for the
    packet which includes the port information in the encapsulated
    transport packet. This is used by the stack to preserve the
    rxhash value in __skb_rx_tunnel.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

07 Aug, 2011

1 commit

  • Computers have become a lot faster since we compromised on the
    partial MD4 hash which we use currently for performance reasons.

    MD5 is a much safer choice, and is inline with both RFC1948 and
    other ISS generators (OpenBSD, Solaris, etc.)

    Furthermore, only having 24-bits of the sequence number be truly
    unpredictable is a very serious limitation. So the periodic
    regeneration and 8-bit counter have been removed. We compute and
    use a full 32-bit sequence number.

    For ipv6, DCCP was found to use a 32-bit truncated initial sequence
    number (it needs 43-bits) and that is fixed here as well.

    Reported-by: Dan Kaminsky
    Tested-by: Willy Tarreau
    Signed-off-by: David S. Miller

    David S. Miller
     

21 Jun, 2011

1 commit


18 Jun, 2011

1 commit

  • Le jeudi 16 juin 2011 à 23:38 -0400, David Miller a écrit :
    > From: Ben Hutchings
    > Date: Fri, 17 Jun 2011 00:50:46 +0100
    >
    > > On Wed, 2011-06-15 at 04:15 +0200, Eric Dumazet wrote:
    > >> @@ -1594,6 +1594,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
    > >> goto discard;
    > >>
    > >> if (nsk != sk) {
    > >> + sock_rps_save_rxhash(nsk, skb->rxhash);
    > >> if (tcp_child_process(sk, nsk, skb)) {
    > >> rsk = nsk;
    > >> goto reset;
    > >>
    > >
    > > I haven't tried this, but it looks reasonable to me.
    > >
    > > What about IPv6? The logic in tcp_v6_do_rcv() looks very similar.
    >
    > Indeed ipv6 side needs the same fix.
    >
    > Eric please add that part and resubmit. And in fact I might stick
    > this into net-2.6 instead of net-next-2.6
    >

    OK, here is the net-2.6 based one then, thanks !

    [PATCH v2] net: rfs: enable RFS before first data packet is received

    First packet received on a passive tcp flow is not correctly RFS
    steered.

    One sock_rps_record_flow() call is missing in inet_accept()

    But before that, we also must record rxhash when child socket is setup.

    Signed-off-by: Eric Dumazet
    CC: Tom Herbert
    CC: Ben Hutchings
    CC: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Jun, 2011

1 commit

  • This patch lowers the default initRTO from 3secs to 1sec per
    RFC2988bis. It falls back to 3secs if the SYN or SYN-ACK packet
    has been retransmitted, AND the TCP timestamp option is not on.

    It also adds support to take RTT sample during 3WHS on the passive
    open side, just like its active open counterpart, and uses it, if
    valid, to seed the initRTO for the data transmission phase.

    The patch also resets ssthresh to its initial default at the
    beginning of the data transmission phase, and reduces cwnd to 1 if
    there has been MORE THAN ONE retransmission during 3WHS per RFC5681.

    Signed-off-by: H.K. Jerry Chu
    Signed-off-by: David S. Miller

    Jerry Chu
     

24 May, 2011

1 commit

  • The %pK format specifier is designed to hide exposed kernel pointers,
    specifically via /proc interfaces. Exposing these pointers provides an
    easy target for kernel write vulnerabilities, since they reveal the
    locations of writable structures containing easily triggerable function
    pointers. The behavior of %pK depends on the kptr_restrict sysctl.

    If kptr_restrict is set to 0, no deviation from the standard %p behavior
    occurs. If kptr_restrict is set to 1, the default, if the current user
    (intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
    (currently in the LSM tree), kernel pointers using %pK are printed as 0's.
    If kptr_restrict is set to 2, kernel pointers using %pK are printed as
    0's regardless of privileges. Replacing with 0's was chosen over the
    default "(null)", which cannot be parsed by userland %p, which expects
    "(nil)".

    The supporting code for kptr_restrict and %pK are currently in the -mm
    tree. This patch converts users of %p in net/ to %pK. Cases of printing
    pointers to the syslog are not covered, since this would eliminate useful
    information for postmortem debugging and the reading of the syslog is
    already optionally protected by the dmesg_restrict sysctl.

    Signed-off-by: Dan Rosenberg
    Cc: James Morris
    Cc: Eric Dumazet
    Cc: Thomas Graf
    Cc: Eugene Teo
    Cc: Kees Cook
    Cc: Ingo Molnar
    Cc: David S. Miller
    Cc: Peter Zijlstra
    Cc: Eric Paris
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Dan Rosenberg
     

29 Apr, 2011

1 commit

  • We lack proper synchronization to manipulate inet->opt ip_options

    Problem is ip_make_skb() calls ip_setup_cork() and
    ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
    without any protection against another thread manipulating inet->opt.

    Another thread can change inet->opt pointer and free old one under us.

    Use RCU to protect inet->opt (changed to inet->inet_opt).

    Instead of handling atomic refcounts, just copy ip_options when
    necessary, to avoid cache line dirtying.

    We cant insert an rcu_head in struct ip_options since its included in
    skb->cb[], so this patch is large because I had to introduce a new
    ip_options_rcu structure.

    Signed-off-by: Eric Dumazet
    Cc: Herbert Xu
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Apr, 2011

1 commit


07 Apr, 2011

1 commit

  • properly record sk_rxhash in ipv6 sockets (v2)

    Noticed while working on another project that flows to sockets which I had open
    on a test systems weren't getting steered properly when I had RFS enabled.
    Looking more closely I found that:

    1) The affected sockets were all ipv6
    2) They weren't getting steered because sk->sk_rxhash was never set from the
    incomming skbs on that socket.

    This was occuring because there are several points in the IPv4 tcp and udp code
    which save the rxhash value when a new connection is established. Those calls
    to sock_rps_save_rxhash were never added to the corresponding ipv6 code paths.
    This patch adds those calls. Tested by myself to properly enable RFS
    functionalty on ipv6.

    Change notes:
    v2:
    Filtered UDP to only arm RFS on bound sockets (Eric Dumazet)

    Signed-off-by: Neil Horman
    Signed-off-by: David S. Miller

    Neil Horman
     

05 Apr, 2011

1 commit


13 Mar, 2011

4 commits


02 Mar, 2011

1 commit

  • Route lookups follow a general pattern in the ipv6 code wherein
    we first find the non-IPSEC route, potentially override the
    flow destination address due to ipv6 options settings, and then
    finally make an IPSEC search using either xfrm_lookup() or
    __xfrm_lookup().

    __xfrm_lookup() is used when we want to generate a blackhole route
    if the key manager needs to resolve the IPSEC rules (in this case
    -EREMOTE is returned and the original 'dst' is left unchanged).

    Otherwise plain xfrm_lookup() is used and when asynchronous IPSEC
    resolution is necessary, we simply fail the lookup completely.

    All of these cases are encapsulated into two routines,
    ip6_dst_lookup_flow and ip6_sk_dst_lookup_flow. The latter of which
    handles unconnected UDP datagram sockets.

    Signed-off-by: David S. Miller

    David S. Miller