02 Jun, 2012

1 commit

  • Another problem on SYNFLOOD/DDOS attack is the inetpeer cache getting
    larger and larger, using lots of memory and cpu time.

    tcp_v4_send_synack()
    ->inet_csk_route_req()
    ->ip_route_output_flow()
    ->rt_set_nexthop()
    ->rt_init_metrics()
    ->inet_getpeer( create = true)

    This is a side effect of commit a4daad6b09230 (net: Pre-COW metrics for
    TCP) added in 2.6.39

    Possible solution :

    Instruct inet_csk_route_req() to remove FLOWI_FLAG_PRECOW_METRICS

    Before patch :

    # grep peer /proc/slabinfo
    inet_peer_cache 4175430 4175430 192 42 2 : tunables 0 0 0 : slabdata 99415 99415 0

    Samples: 41K of event 'cycles', Event count (approx.): 30716565122
    + 20,24% ksoftirqd/0 [kernel.kallsyms] [k] inet_getpeer
    + 8,19% ksoftirqd/0 [kernel.kallsyms] [k] peer_avl_rebalance.isra.1
    + 4,81% ksoftirqd/0 [kernel.kallsyms] [k] sha_transform
    + 3,64% ksoftirqd/0 [kernel.kallsyms] [k] fib_table_lookup
    + 2,36% ksoftirqd/0 [ixgbe] [k] ixgbe_poll
    + 2,16% ksoftirqd/0 [kernel.kallsyms] [k] __ip_route_output_key
    + 2,11% ksoftirqd/0 [kernel.kallsyms] [k] kernel_map_pages
    + 2,11% ksoftirqd/0 [kernel.kallsyms] [k] ip_route_input_common
    + 2,01% ksoftirqd/0 [kernel.kallsyms] [k] __inet_lookup_established
    + 1,83% ksoftirqd/0 [kernel.kallsyms] [k] md5_transform
    + 1,75% ksoftirqd/0 [kernel.kallsyms] [k] check_leaf.isra.9
    + 1,49% ksoftirqd/0 [kernel.kallsyms] [k] ipt_do_table
    + 1,46% ksoftirqd/0 [kernel.kallsyms] [k] hrtimer_interrupt
    + 1,45% ksoftirqd/0 [kernel.kallsyms] [k] kmem_cache_alloc
    + 1,29% ksoftirqd/0 [kernel.kallsyms] [k] inet_csk_search_req
    + 1,29% ksoftirqd/0 [kernel.kallsyms] [k] __netif_receive_skb
    + 1,16% ksoftirqd/0 [kernel.kallsyms] [k] copy_user_generic_string
    + 1,15% ksoftirqd/0 [kernel.kallsyms] [k] kmem_cache_free
    + 1,02% ksoftirqd/0 [kernel.kallsyms] [k] tcp_make_synack
    + 0,93% ksoftirqd/0 [kernel.kallsyms] [k] _raw_spin_lock_bh
    + 0,87% ksoftirqd/0 [kernel.kallsyms] [k] __call_rcu
    + 0,84% ksoftirqd/0 [kernel.kallsyms] [k] rt_garbage_collect
    + 0,84% ksoftirqd/0 [kernel.kallsyms] [k] fib_rules_lookup

    Signed-off-by: Eric Dumazet
    Cc: Hans Schillstrom
    Cc: Jesper Dangaard Brouer
    Cc: Neal Cardwell
    Cc: Tom Herbert
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Apr, 2012

1 commit


16 Apr, 2012

1 commit


15 Apr, 2012

3 commits

  • We must try harder to get unique (addr, port) pairs when
    doing port autoselection for sockets with SO_REUSEADDR
    option set.

    We achieve this by adding a relaxation parameter to
    inet_csk_bind_conflict. When 'relax' parameter is off
    we return a conflict whenever the current searched
    pair (addr, port) is not unique.

    This tries to address the problems reported in patch:
    8d238b25b1ec22a73b1c2206f111df2faaff8285
    Revert "tcp: bind() fix when many ports are bound"

    Tests where ran for creating and binding(0) many sockets
    on 100 IPs. The results are, on average:

    * 60000 sockets, 600 ports / IP:
    * 0.210 s, 620 (IP, port) duplicates without patch
    * 0.219 s, no duplicates with patch
    * 100000 sockets, 1000 ports / IP:
    * 0.371 s, 1720 duplicates without patch
    * 0.373 s, no duplicates with patch
    * 200000 sockets, 2000 ports / IP:
    * 0.766 s, 6900 duplicates without patch
    * 0.768 s, no duplicates with patch
    * 500000 sockets, 5000 ports / IP:
    * 2.227 s, 41500 duplicates without patch
    * 2.284 s, no duplicates with patch

    Signed-off-by: Alex Copot
    Signed-off-by: Daniel Baluta
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alex Copot
     
  • There are two struct request_sock_ops providers, tcp and dccp.

    inet_csk_reqsk_queue_prune() can avoid testing syn_ack_timeout being
    NULL if we make it non NULL like syn_ack_timeout

    Signed-off-by: Eric Dumazet
    Cc: Gerrit Renker
    Cc: dccp@vger.kernel.org
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Updates some comments to track RFC6298

    Signed-off-by: Eric Dumazet
    Cc: H.K. Jerry Chu
    Cc: Tom Herbert
    Signed-off-by: David S. Miller

    Eric Dumazet
     

26 Jan, 2012

2 commits

  • Port autoselection finds a port and then drop the lock,
    then right after that, gets the hash bucket again and lock it.

    Fix it to go direct.

    Signed-off-by: Flavio Leitner
    Signed-off-by: Marcelo Ricardo Leitner
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Flavio Leitner
     
  • The current code checks for conflicts when the application
    requests a specific port. If there is no conflict, then
    the request is granted.

    On the other hand, the port autoselection done by the kernel
    fails when all ports are bound even when there is a port
    with no conflict available.

    The fix changes port autoselection to check if there is a
    conflict and use it if not.

    Signed-off-by: Flavio Leitner
    Signed-off-by: Marcelo Ricardo Leitner
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Flavio Leitner
     

12 Dec, 2011

1 commit


09 Nov, 2011

1 commit


24 May, 2011

1 commit

  • All static seqlock should be initialized with the lockdep friendly
    __SEQLOCK_UNLOCKED() macro.

    Remove legacy SEQLOCK_UNLOCKED() macro.

    Signed-off-by: Eric Dumazet
    Cc: David Miller
    Link: http://lkml.kernel.org/r/%3C1306238888.3026.31.camel%40edumazet-laptop%3E
    Signed-off-by: Thomas Gleixner

    Eric Dumazet
     

19 May, 2011

1 commit


09 May, 2011

1 commit

  • This is just like inet_csk_route_req() except that it operates after
    we've created the new child socket.

    In this way we can use the new socket's cork flow for proper route
    key storage.

    This will be used by DCCP and TCP child socket creation handling.

    Signed-off-by: David S. Miller

    David S. Miller
     

29 Apr, 2011

2 commits

  • Now that output route lookups update the flow with
    destination address selection, we can fetch it from
    fl4->daddr instead of rt->rt_dst

    Signed-off-by: David S. Miller

    David S. Miller
     
  • We lack proper synchronization to manipulate inet->opt ip_options

    Problem is ip_make_skb() calls ip_setup_cork() and
    ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
    without any protection against another thread manipulating inet->opt.

    Another thread can change inet->opt pointer and free old one under us.

    Use RCU to protect inet->opt (changed to inet->inet_opt).

    Instead of handling atomic refcounts, just copy ip_options when
    necessary, to avoid cache line dirtying.

    We cant insert an rcu_head in struct ip_options since its included in
    skb->cb[], so this patch is large because I had to introduce a new
    ip_options_rcu structure.

    Signed-off-by: Eric Dumazet
    Cc: Herbert Xu
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 Apr, 2011

1 commit


14 Apr, 2011

1 commit

  • This reverts commit c191a836a908d1dd6b40c503741f91b914de3348.

    It causes known regressions for programs that expect to be able to use
    SO_REUSEADDR to shutdown a socket, then successfully rebind another
    socket to the same ID.

    Programs such as haproxy and amavisd expect this to work.

    This should fix kernel bugzilla 32832.

    Signed-off-by: David S. Miller

    David S. Miller
     

31 Mar, 2011

1 commit


13 Mar, 2011

4 commits


03 Mar, 2011

1 commit


02 Mar, 2011

2 commits


12 Jan, 2011

1 commit

  • inet_csk_bind_conflict() logic currently disallows a bind() if
    it finds a friend socket (a socket bound on same address/port)
    satisfying a set of conditions :

    1) Current (to be bound) socket doesnt have sk_reuse set
    OR
    2) other socket doesnt have sk_reuse set
    OR
    3) other socket is in LISTEN state

    We should add the CLOSE state in the 3) condition, in order to avoid two
    REUSEADDR sockets in CLOSE state with same local address/port, since
    this can deny further operations.

    Note : a prior patch tried to address the problem in a different (and
    buggy) way. (commit fda48a0d7a8412ced tcp: bind() fix when many ports
    are bound).

    Reported-by: Gaspar Chilingarov
    Reported-by: Daniel Baluta
    Tested-by: Daniel Baluta
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Dec, 2010

1 commit

  • Followup of commit b178bb3dfc30 (net: reorder struct sock fields)

    Optimize INET input path a bit further, by :

    1) moving sk_refcnt close to sk_lock.

    This reduces number of dirtied cache lines by one on 64bit arches (and
    64 bytes cache line size).

    2) moving inet_daddr & inet_rcv_saddr at the beginning of sk

    (same cache line than hash / family / bound_dev_if / nulls_node)

    This reduces number of accessed cache lines in lookups by one, and dont
    increase size of inet and timewait socks.
    inet and tw sockets now share same place-holder for these fields.

    Before patch :

    offsetof(struct sock, sk_refcnt) = 0x10
    offsetof(struct sock, sk_lock) = 0x40
    offsetof(struct sock, sk_receive_queue) = 0x60
    offsetof(struct inet_sock, inet_daddr) = 0x270
    offsetof(struct inet_sock, inet_rcv_saddr) = 0x274

    After patch :

    offsetof(struct sock, sk_refcnt) = 0x44
    offsetof(struct sock, sk_lock) = 0x48
    offsetof(struct sock, sk_receive_queue) = 0x68
    offsetof(struct inet_sock, inet_daddr) = 0x0
    offsetof(struct inet_sock, inet_rcv_saddr) = 0x4

    compute_score() (udp or tcp) now use a single cache line per ignored
    item, instead of two.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Nov, 2010

1 commit


13 Jul, 2010

1 commit


11 Jun, 2010

1 commit


16 May, 2010

1 commit

  • (Dropped the infiniband part, because Tetsuo modified the related code,
    I will send a separate patch for it once this is accepted.)

    This patch introduces /proc/sys/net/ipv4/ip_local_reserved_ports which
    allows users to reserve ports for third-party applications.

    The reserved ports will not be used by automatic port assignments
    (e.g. when calling connect() or bind() with port number 0). Explicit
    port allocation behavior is unchanged.

    Signed-off-by: Octavian Purdila
    Signed-off-by: WANG Cong
    Cc: Neil Horman
    Cc: Eric Dumazet
    Cc: Eric W. Biederman
    Signed-off-by: David S. Miller

    Amerigo Wang
     

03 May, 2010

1 commit


29 Apr, 2010

1 commit

  • This reverts two commits:

    fda48a0d7a8412cedacda46a9c0bf8ef9cd13559
    tcp: bind() fix when many ports are bound

    and a follow-on fix for it:

    6443bb1fc2050ca2b6585a3fa77f7833b55329ed
    ipv6: Fix inet6_csk_bind_conflict()

    It causes problems with binding listening sockets when time-wait
    sockets from a previous instance still are alive.

    It's too late to keep fiddling with this so late in the -rc
    series, and we'll deal with it in net-next-2.6 instead.

    Signed-off-by: David S. Miller

    David S. Miller
     

28 Apr, 2010

1 commit


23 Apr, 2010

1 commit

  • Port autoselection done by kernel only works when number of bound
    sockets is under a threshold (typically 30000).

    When this threshold is over, we must check if there is a conflict before
    exiting first loop in inet_csk_get_port()

    Change inet_csk_bind_conflict() to forbid two reuse-enabled sockets to
    bind on same (address,port) tuple (with a non ANY address)

    Same change for inet6_csk_bind_conflict()

    Reported-by: Gaspar Chilingarov
    Signed-off-by: Eric Dumazet
    Acked-by: Evgeniy Polyakov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Apr, 2010

1 commit

  • Define a new function to return the waitqueue of a "struct sock".

    static inline wait_queue_head_t *sk_sleep(struct sock *sk)
    {
    return sk->sk_sleep;
    }

    Change all read occurrences of sk_sleep by a call to this function.

    Needed for a future RCU conversion. sk_sleep wont be a field directly
    available.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Jan, 2010

1 commit

  • Currently we don't increment SYN-ACK timeouts & retransmissions
    although we do increment the same stats for SYN. We seem to have lost
    the SYN-ACK accounting with the introduction of tcp_syn_recv_timer
    (commit 2248761e in the netdev-vger-cvs tree).

    This patch fixes this issue. In the process we also rename the v4/v6
    syn/ack retransmit functions for clarity. We also add a new
    request_socket operations (syn_ack_timeout) so we can keep code in
    inet_connection_sock.c protocol agnostic.

    Signed-off-by: Octavian Purdila
    Signed-off-by: David S. Miller

    Octavian Purdila
     

03 Dec, 2009

1 commit

  • Add optional function parameters associated with sending SYNACK.
    These parameters are not needed after sending SYNACK, and are not
    used for retransmission. Avoids extending struct tcp_request_sock,
    and avoids allocating kernel memory.

    Also affects DCCP as it uses common struct request_sock_ops,
    but this parameter is currently reserved for future use.

    Signed-off-by: William.Allen.Simpson@gmail.com
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    William Allen Simpson
     

26 Nov, 2009

1 commit

  • Generated with the following semantic patch

    @@
    struct net *n1;
    struct net *n2;
    @@
    - n1 == n2
    + net_eq(n1, n2)

    @@
    struct net *n1;
    struct net *n2;
    @@
    - n1 != n2
    + !net_eq(n1, n2)

    applied over {include,net,drivers/net}.

    Signed-off-by: Octavian Purdila
    Signed-off-by: David S. Miller

    Octavian Purdila
     

27 Oct, 2009

1 commit