19 Mar, 2012

1 commit


13 Mar, 2012

1 commit

  • Add #define pr_fmt(fmt) as appropriate.

    Add "IPv4: ", "TCP: ", and "IPsec: " to appropriate files.
    Standardize on "UDPLite: " for appropriate uses.
    Some prefixes were previously "UDPLITE: " and "UDP-Lite: ".

    Add KBUILD_MODNAME ": " to icmp and gre.
    Remove embedded prefixes as appropriate.

    Add missing "\n" to pr_info in gre.c.

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     

12 Mar, 2012

2 commits

  • Use a more current kernel messaging style.

    Convert a printk block to print_hex_dump.
    Coalesce formats, align arguments.
    Use %s, __func__ instead of embedding function names.

    Some messages that were prefixed with _close are
    now prefixed with _fini. Some ah4 and esp messages
    are now not prefixed with "ip ".

    The intent of this patch is to later add something like
    #define pr_fmt(fmt) "IPv4: " fmt.
    to standardize the output messages.

    Text size is trivially reduced. (x86-32 allyesconfig)

    $ size net/ipv4/built-in.o*
    text data bss dec hex filename
    887888 31558 249696 1169142 11d6f6 net/ipv4/built-in.o.new
    887934 31558 249800 1169292 11d78c net/ipv4/built-in.o.old

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     
  • commit ea4fc0d619 (ipv4: Don't use rt->rt_{src,dst} in ip_queue_xmit())
    added a serious regression on synflood handling.

    Simon Kirby discovered a successful connection was delayed by 20 seconds
    before being responsive.

    In my tests, I discovered that xmit frames were lost, and needed ~4
    retransmits and a socket dst rebuild before being really sent.

    In case of syncookie initiated connection, we use a different path to
    initialize the socket dst, and inet->cork.fl.u.ip4 is left cleared.

    As ip_queue_xmit() now depends on inet flow being setup, fix this by
    copying the temp flowi4 we use in cookie_v4_check().

    Reported-by: Simon Kirby
    Bisected-by: Simon Kirby
    Signed-off-by: Eric Dumazet
    Tested-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Mar, 2012

1 commit

  • commit a8afca0329 (tcp: md5: protects md5sig_info with RCU) added a
    lockdep splat in tcp_md5_do_lookup() in case a timer fires a tcp
    retransmit.

    At this point, socket lock is owned by the sofirq handler, not the user,
    so we should adjust a bit the lockdep condition, as we dont hold
    rcu_read_lock().

    Signed-off-by: Eric Dumazet
    Reported-by: Valdis Kletnieks
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Feb, 2012

1 commit

  • Currently, it is not easily possible to get TOS/DSCP value of packets from
    an incoming TCP stream. The mechanism is there, IP_PKTOPTIONS getsockopt
    with IP_RECVTOS set, the same way as incoming TTL can be queried. This is
    not actually implemented for TOS, though.

    This patch adds this functionality, both for IPv4 (IP_PKTOPTIONS) and IPv6
    (IPV6_2292PKTOPTIONS). For IPv4, like in the IP_RECVTTL case, the value of
    the TOS field is stored from the other party's ACK.

    This is needed for proxies which require DSCP transparency. One such example
    is at http://zph.bratcheda.org/.

    Signed-off-by: Jiri Benc
    Signed-off-by: David S. Miller

    Jiri Benc
     

11 Feb, 2012

1 commit


05 Feb, 2012

1 commit

  • Binding RST packet outgoing interface to incoming interface
    for tcp v4 when there is no socket associate with it.
    when sk is not NULL, using sk->sk_bound_dev_if instead.
    (suggested by Eric Dumazet).

    This has few benefits:
    1. tcp_v6_send_reset already did that.
    2. This helps tcp connect with SO_BINDTODEVICE set. When
    connection is lost, we still able to sending out RST using
    same interface.
    3. we are sending reply, it is most likely to be succeed
    if iif is used

    Signed-off-by: Shawn Lu
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Shawn Lu
     

02 Feb, 2012

1 commit

  • TCP RST mechanism is broken in TCP md5(RFC2385). When
    connection is gone, md5 key is lost, sending RST
    without md5 hash is deem to ignored by peer. This can
    be a problem since RST help protocal like bgp to fast
    recove from peer crash.

    In most case, users of tcp md5, such as bgp and ldp,
    have listener on both sides to accept connection from peer.
    md5 keys for peers are saved in listening socket.

    There are two cases in finding md5 key when connection is
    lost:
    1.Passive receive RST: The message is send to well known port,
    tcp will associate it with listner. md5 key is gotten from
    listener.

    2.Active receive RST (no sock): The message is send to ative
    side, there is no socket associated with the message. In this
    case, finding listener from source port, then find md5 key from
    listener.

    we are not loosing sercuriy here:
    packet is checked with md5 hash. No RST is generated
    if md5 hash doesn't match or no md5 key can be found.

    Signed-off-by: Shawn Lu
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Shawn Lu
     

01 Feb, 2012

4 commits

  • This patch makes sure we use appropriate memory barriers before
    publishing tp->md5sig_info, allowing tcp_md5_do_lookup() being used from
    tcp_v4_send_reset() without holding socket lock (upcoming patch from
    Shawn Lu)

    Note we also need to respect rcu grace period before its freeing, since
    we can free socket without this grace period thanks to
    SLAB_DESTROY_BY_RCU

    Signed-off-by: Eric Dumazet
    Cc: Shawn Lu
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • There is no limit on number of MD5 keys an application can attach to a
    tcp socket.

    This patch adds a per tcp socket limit based
    on /proc/sys/net/core/optmem_max

    With current default optmem_max values, this allows about 150 keys on
    64bit arches, and 88 keys on 32bit arches.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • In order to be able to support proper RST messages for TCP MD5 flows, we
    need to allow access to MD5 keys without locking listener socket.

    This conversion is a nice cleanup, and shrinks size of timewait sockets
    by 80 bytes.

    IPv6 code reuses generic code found in IPv4 instead of duplicating it.

    Control path uses GFP_KERNEL allocations instead of GFP_ATOMIC.

    Signed-off-by: Eric Dumazet
    Cc: Shawn Lu
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • We no longer use md5_add() method from struct tcp_sock_af_ops

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Jan, 2012

1 commit


13 Dec, 2011

3 commits

  • This patch allows each namespace to independently set up
    its levels for tcp memory pressure thresholds. This patch
    alone does not buy much: we need to make this values
    per group of process somehow. This is achieved in the
    patches that follows in this patchset.

    Signed-off-by: Glauber Costa
    Reviewed-by: KAMEZAWA Hiroyuki
    CC: David S. Miller
    CC: Eric W. Biederman
    Signed-off-by: David S. Miller

    Glauber Costa
     
  • This patch introduces memory pressure controls for the tcp
    protocol. It uses the generic socket memory pressure code
    introduced in earlier patches, and fills in the
    necessary data in cg_proto struct.

    Signed-off-by: Glauber Costa
    Reviewed-by: KAMEZAWA Hiroyuki
    CC: Eric W. Biederman
    Signed-off-by: David S. Miller

    Glauber Costa
     
  • This patch replaces all uses of struct sock fields' memory_pressure,
    memory_allocated, sockets_allocated, and sysctl_mem to acessor
    macros. Those macros can either receive a socket argument, or a mem_cgroup
    argument, depending on the context they live in.

    Since we're only doing a macro wrapping here, no performance impact at all is
    expected in the case where we don't have cgroups disabled.

    Signed-off-by: Glauber Costa
    Reviewed-by: Hiroyouki Kamezawa
    CC: David S. Miller
    CC: Eric W. Biederman
    CC: Eric Dumazet
    Signed-off-by: David S. Miller

    Glauber Costa
     

01 Dec, 2011

1 commit

  • Rick Jones reported that TCP_CONGESTION sockopt performed on a listener
    was ignored for its children sockets : right after accept() the
    congestion control for new socket is the system default one.

    This seems an oversight of the initial design (quoted from Stephen)

    Based on prior investigation and patch from Rick.

    Reported-by: Rick Jones
    Signed-off-by: Eric Dumazet
    CC: Stephen Hemminger
    CC: Yuchung Cheng
    Tested-by: Rick Jones
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Nov, 2011

1 commit

  • Simon Kirby reported divides by zero errors in __tcp_select_window()

    This happens when inet_csk_route_child_sock() returns a NULL pointer :

    We free new socket while we eventually armed keepalive timer in
    tcp_create_openreq_child()

    Fix this by a call to tcp_clear_xmit_timers()

    [ This is a followup to commit 918eb39962dff (net: add missing
    bh_unlock_sock() calls) ]

    Reported-by: Simon Kirby
    Signed-off-by: Eric Dumazet
    Tested-by: Simon Kirby
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Nov, 2011

1 commit

  • Simon Kirby reported lockdep warnings and following messages :

    [104661.897577] huh, entered softirq 3 NET_RX ffffffff81613740
    preempt_count 00000101, exited with 00000102?

    [104661.923653] huh, entered softirq 3 NET_RX ffffffff81613740
    preempt_count 00000101, exited with 00000102?

    Problem comes from commit 0e734419
    (ipv4: Use inet_csk_route_child_sock() in DCCP and TCP.)

    If inet_csk_route_child_sock() returns NULL, we should release socket
    lock before freeing it.

    Another lock imbalance exists if __inet_inherit_port() returns an error
    since commit 093d282321da ( tproxy: fix hash locking issue when using
    port redirection in __inet_inherit_port()) a backport is also needed for
    >= 2.6.37 kernels.

    Reported-by: Simon Kirby
    Signed-off-by: Eric Dumazet
    Tested-by: Eric Dumazet
    CC: Balazs Scheidler
    CC: KOVACS Krisztian
    Reviewed-by: Thomas Gleixner
    Tested-by: Simon Kirby
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Nov, 2011

1 commit

  • the tcp and udp code creates a set of struct file_operations at runtime
    while it can also be done at compile time, with the added benefit of then
    having these file operations be const.

    the trickiest part was to get the "THIS_MODULE" reference right; the naive
    method of declaring a struct in the place of registration would not work
    for this reason.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: David S. Miller

    Arjan van de Ven
     

24 Oct, 2011

2 commits

  • There is a long standing bug in linux tcp stack, about ACK messages sent
    on behalf of TIME_WAIT sockets.

    In the IP header of the ACK message, we choose to reflect TOS field of
    incoming message, and this might break some setups.

    Example of things that were broken :
    - Routing using TOS as a selector
    - Firewalls
    - Trafic classification / shaping

    We now remember in timewait structure the inet tos field and use it in
    ACK generation, and route lookup.

    Notes :
    - We still reflect incoming TOS in RST messages.
    - We could extend MuraliRaja Muniraju patch to report TOS value in
    netlink messages for TIME_WAIT sockets.
    - A patch is needed for IPv6

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Now tcp_md5_hash_header() has a const tcphdr argument, we can add more
    const attributes to callers.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Oct, 2011

1 commit

  • Adding const qualifiers to pointers can ease code review, and spot some
    bugs. It might allow compiler to optimize code further.

    For example, is it legal to temporary write a null cksum into tcphdr
    in tcp_md5_hash_header() ? I am afraid a sniffer could catch the
    temporary null value...

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Oct, 2011

1 commit


05 Oct, 2011

1 commit

  • tcp_v4_clear_md5_list() assumes that multiple tcp md5sig peers
    only hold one reference to md5sig_pool. but tcp_v4_md5_do_add()
    increases use count of md5sig_pool for each peer. This patch
    makes tcp_v4_md5_do_add() only increases use count for the first
    tcp md5sig peer.

    Signed-off-by: Zheng Yan
    Signed-off-by: David S. Miller

    Yan, Zheng
     

27 Sep, 2011

1 commit

  • struct tcp_skb_cb contains a "flags" field containing either tcp flags
    or IP dsfield depending on context (input or output path)

    Introduce ip_dsfield to make the difference clear and ease maintenance.
    If later we want to save space, we can union flags/ip_dsfield

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Sep, 2011

1 commit

  • Conflicts:
    MAINTAINERS
    drivers/net/Kconfig
    drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
    drivers/net/ethernet/broadcom/tg3.c
    drivers/net/wireless/iwlwifi/iwl-pci.c
    drivers/net/wireless/iwlwifi/iwl-trans-tx-pcie.c
    drivers/net/wireless/rt2x00/rt2800usb.c
    drivers/net/wireless/wl12xx/main.c

    David S. Miller
     

16 Sep, 2011

1 commit

  • "Possible SYN flooding on port xxxx " messages can fill logs on servers.

    Change logic to log the message only once per listener, and add two new
    SNMP counters to track :

    TCPReqQFullDoCookies : number of times a SYNCOOKIE was replied to client

    TCPReqQFullDrop : number of times a SYN request was dropped because
    syncookies were not enabled.

    Based on a prior patch from Tom Herbert, and suggestions from David.

    Signed-off-by: Eric Dumazet
    CC: Tom Herbert
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Aug, 2011

1 commit

  • The l4_rxhash flag was added to the skb structure to indicate
    that the rxhash value was computed over the 4 tuple for the
    packet which includes the port information in the encapsulated
    transport packet. This is used by the stack to preserve the
    rxhash value in __skb_rx_tunnel.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

07 Aug, 2011

1 commit

  • Computers have become a lot faster since we compromised on the
    partial MD4 hash which we use currently for performance reasons.

    MD5 is a much safer choice, and is inline with both RFC1948 and
    other ISS generators (OpenBSD, Solaris, etc.)

    Furthermore, only having 24-bits of the sequence number be truly
    unpredictable is a very serious limitation. So the periodic
    regeneration and 8-bit counter have been removed. We compute and
    use a full 32-bit sequence number.

    For ipv6, DCCP was found to use a 32-bit truncated initial sequence
    number (it needs 43-bits) and that is fixed here as well.

    Reported-by: Dan Kaminsky
    Tested-by: Willy Tarreau
    Signed-off-by: David S. Miller

    David S. Miller
     

21 Jun, 2011

1 commit


18 Jun, 2011

1 commit

  • Le jeudi 16 juin 2011 à 23:38 -0400, David Miller a écrit :
    > From: Ben Hutchings
    > Date: Fri, 17 Jun 2011 00:50:46 +0100
    >
    > > On Wed, 2011-06-15 at 04:15 +0200, Eric Dumazet wrote:
    > >> @@ -1594,6 +1594,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
    > >> goto discard;
    > >>
    > >> if (nsk != sk) {
    > >> + sock_rps_save_rxhash(nsk, skb->rxhash);
    > >> if (tcp_child_process(sk, nsk, skb)) {
    > >> rsk = nsk;
    > >> goto reset;
    > >>
    > >
    > > I haven't tried this, but it looks reasonable to me.
    > >
    > > What about IPv6? The logic in tcp_v6_do_rcv() looks very similar.
    >
    > Indeed ipv6 side needs the same fix.
    >
    > Eric please add that part and resubmit. And in fact I might stick
    > this into net-2.6 instead of net-next-2.6
    >

    OK, here is the net-2.6 based one then, thanks !

    [PATCH v2] net: rfs: enable RFS before first data packet is received

    First packet received on a passive tcp flow is not correctly RFS
    steered.

    One sock_rps_record_flow() call is missing in inet_accept()

    But before that, we also must record rxhash when child socket is setup.

    Signed-off-by: Eric Dumazet
    CC: Tom Herbert
    CC: Ben Hutchings
    CC: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Jun, 2011

1 commit

  • This patch lowers the default initRTO from 3secs to 1sec per
    RFC2988bis. It falls back to 3secs if the SYN or SYN-ACK packet
    has been retransmitted, AND the TCP timestamp option is not on.

    It also adds support to take RTT sample during 3WHS on the passive
    open side, just like its active open counterpart, and uses it, if
    valid, to seed the initRTO for the data transmission phase.

    The patch also resets ssthresh to its initial default at the
    beginning of the data transmission phase, and reduces cwnd to 1 if
    there has been MORE THAN ONE retransmission during 3WHS per RFC5681.

    Signed-off-by: H.K. Jerry Chu
    Signed-off-by: David S. Miller

    Jerry Chu
     

24 May, 2011

1 commit

  • The %pK format specifier is designed to hide exposed kernel pointers,
    specifically via /proc interfaces. Exposing these pointers provides an
    easy target for kernel write vulnerabilities, since they reveal the
    locations of writable structures containing easily triggerable function
    pointers. The behavior of %pK depends on the kptr_restrict sysctl.

    If kptr_restrict is set to 0, no deviation from the standard %p behavior
    occurs. If kptr_restrict is set to 1, the default, if the current user
    (intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
    (currently in the LSM tree), kernel pointers using %pK are printed as 0's.
    If kptr_restrict is set to 2, kernel pointers using %pK are printed as
    0's regardless of privileges. Replacing with 0's was chosen over the
    default "(null)", which cannot be parsed by userland %p, which expects
    "(nil)".

    The supporting code for kptr_restrict and %pK are currently in the -mm
    tree. This patch converts users of %p in net/ to %pK. Cases of printing
    pointers to the syslog are not covered, since this would eliminate useful
    information for postmortem debugging and the reading of the syslog is
    already optionally protected by the dmesg_restrict sysctl.

    Signed-off-by: Dan Rosenberg
    Cc: James Morris
    Cc: Eric Dumazet
    Cc: Thomas Graf
    Cc: Eugene Teo
    Cc: Kees Cook
    Cc: Ingo Molnar
    Cc: David S. Miller
    Cc: Peter Zijlstra
    Cc: Eric Paris
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Dan Rosenberg
     

19 May, 2011

3 commits


11 May, 2011

1 commit


09 May, 2011

1 commit