11 Jun, 2010

1 commit


01 Jun, 2010

1 commit

  • Correct sk_forward_alloc handling for error_queue would need to use a
    backlog of frames that softirq handler could not deliver because socket
    is owned by user thread. Or extend backlog processing to be able to
    process normal and error packets.

    Another possibility is to not use mem charge for error queue, this is
    what I implemented in this patch.

    Note: this reverts commit 29030374
    (net: fix sk_forward_alloc corruptions), since we dont need to lock
    socket anymore.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

31 May, 2010

1 commit


29 May, 2010

2 commits

  • As David found out, sock_queue_err_skb() should be called with socket
    lock hold, or we risk sk_forward_alloc corruption, since we use non
    atomic operations to update this field.

    This patch adds bh_lock_sock()/bh_unlock_sock() pair to three spots.
    (BH already disabled)

    1) skb_tstamp_tx()
    2) Before calling ip_icmp_error(), in __udp4_lib_err()
    3) Before calling ipv6_icmp_error(), in __udp6_lib_err()

    Reported-by: Anton Blanchard
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (22 commits)
    netlink: bug fix: wrong size was calculated for vfinfo list blob
    netlink: bug fix: don't overrun skbs on vf_port dump
    xt_tee: use skb_dst_drop()
    netdev/fec: fix ifconfig eth0 down hang issue
    cnic: Fix context memory init. on 5709.
    drivers/net: Eliminate a NULL pointer dereference
    drivers/net/hamradio: Eliminate a NULL pointer dereference
    be2net: Patch removes redundant while statement in loop.
    ipv6: Add GSO support on forwarding path
    net: fix __neigh_event_send()
    vhost: fix the memory leak which will happen when memory_access_ok fails
    vhost-net: fix to check the return value of copy_to/from_user() correctly
    vhost: fix to check the return value of copy_to/from_user() correctly
    vhost: Fix host panic if ioctl called with wrong index
    net: fix lock_sock_bh/unlock_sock_bh
    net/iucv: Add missing spin_unlock
    net: ll_temac: fix checksum offload logic
    net: ll_temac: fix interrupt bug when interrupt 0 is used
    sctp: dubious bitfields in sctp_transport
    ipmr: off by one in __ipmr_fill_mroute()
    ...

    Linus Torvalds
     

27 May, 2010

1 commit

  • This new sock lock primitive was introduced to speedup some user context
    socket manipulation. But it is unsafe to protect two threads, one using
    regular lock_sock/release_sock, one using lock_sock_bh/unlock_sock_bh

    This patch changes lock_sock_bh to be careful against 'owned' state.
    If owned is found to be set, we must take the slow path.
    lock_sock_bh() now returns a boolean to say if the slow path was taken,
    and this boolean is used at unlock_sock_bh time to call the appropriate
    unlock function.

    After this change, BH are either disabled or enabled during the
    lock_sock_bh/unlock_sock_bh protected section. This might be misleading,
    so we rename these functions to lock_sock_fast()/unlock_sock_fast().

    Reported-by: Anton Blanchard
    Signed-off-by: Eric Dumazet
    Tested-by: Anton Blanchard
    Signed-off-by: David S. Miller

    Eric Dumazet
     

25 May, 2010

1 commit


16 May, 2010

1 commit

  • (Dropped the infiniband part, because Tetsuo modified the related code,
    I will send a separate patch for it once this is accepted.)

    This patch introduces /proc/sys/net/ipv4/ip_local_reserved_ports which
    allows users to reserve ports for third-party applications.

    The reserved ports will not be used by automatic port assignments
    (e.g. when calling connect() or bind() with port number 0). Explicit
    port allocation behavior is unchanged.

    Signed-off-by: Octavian Purdila
    Signed-off-by: WANG Cong
    Cc: Neil Horman
    Cc: Eric Dumazet
    Cc: Eric W. Biederman
    Signed-off-by: David S. Miller

    Amerigo Wang
     

12 May, 2010

1 commit


07 May, 2010

1 commit

  • commit 2783ef23 moved the initialisation of saddr and daddr after
    pskb_may_pull() to avoid a potential data corruption. Unfortunately
    also placing it after the short packet and bad checksum error paths,
    where these variables are used for logging. The result is bogus
    output like

    [92238.389505] UDP: short packet: From 2.0.0.0:65535 23715/178 to 0.0.0.0:65535

    Moving the saddr and daddr initialisation above the error paths, while still
    keeping it after the pskb_may_pull() to keep the fix from commit 2783ef23.

    Signed-off-by: Bjørn Mork
    Cc: stable@kernel.org
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Bjørn Mork
     

29 Apr, 2010

2 commits

  • When queueing a skb to socket, we can immediately release its dst if
    target socket do not use IP_CMSG_PKTINFO.

    tcp_data_queue() can drop dst too.

    This to benefit from a hot cache line and avoid the receiver, possibly
    on another cpu, to dirty this cache line himself.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Since commit 95766fff ([UDP]: Add memory accounting.),
    each received packet needs one extra sock_lock()/sock_release() pair.

    This added latency because of possible backlog handling. Then later,
    ticket spinlocks added yet another latency source in case of DDOS.

    This patch introduces lock_sock_bh() and unlock_sock_bh()
    synchronization primitives, avoiding one atomic operation and backlog
    processing.

    skb_free_datagram_locked() uses them instead of full blown
    lock_sock()/release_sock(). skb is orphaned inside locked section for
    proper socket memory reclaim, and finally freed outside of it.

    UDP receive path now take the socket spinlock only once.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Apr, 2010

2 commits

  • Current socket backlog limit is not enough to really stop DDOS attacks,
    because user thread spend many time to process a full backlog each
    round, and user might crazy spin on socket lock.

    We should add backlog size and receive_queue size (aka rmem_alloc) to
    pace writers, and let user run without being slow down too much.

    Introduce a sk_rcvqueues_full() helper, to avoid taking socket lock in
    stress situations.

    Under huge stress from a multiqueue/RPS enabled NIC, a single flow udp
    receiver can now process ~200.000 pps (instead of ~100 pps before the
    patch) on a 8 core machine.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Idea from Eric Dumazet.

    As for placement inside of struct sock, I tried to choose a place
    that otherwise has a 32-bit hole on 64-bit systems.

    Signed-off-by: David S. Miller
    Acked-by: Eric Dumazet

    David S. Miller
     

21 Apr, 2010

1 commit


17 Apr, 2010

1 commit

  • This patch implements receive flow steering (RFS). RFS steers
    received packets for layer 3 and 4 processing to the CPU where
    the application for the corresponding flow is running. RFS is an
    extension of Receive Packet Steering (RPS).

    The basic idea of RFS is that when an application calls recvmsg
    (or sendmsg) the application's running CPU is stored in a hash
    table that is indexed by the connection's rxhash which is stored in
    the socket structure. The rxhash is passed in skb's received on
    the connection from netif_receive_skb. For each received packet,
    the associated rxhash is used to look up the CPU in the hash table,
    if a valid CPU is set then the packet is steered to that CPU using
    the RPS mechanisms.

    The convolution of the simple approach is that it would potentially
    allow OOO packets. If threads are thrashing around CPUs or multiple
    threads are trying to read from the same sockets, a quickly changing
    CPU value in the hash table could cause rampant OOO packets--
    we consider this a non-starter.

    To avoid OOO packets, this solution implements two types of hash
    tables: rps_sock_flow_table and rps_dev_flow_table.

    rps_sock_table is a global hash table. Each entry is just a CPU
    number and it is populated in recvmsg and sendmsg as described above.
    This table contains the "desired" CPUs for flows.

    rps_dev_flow_table is specific to each device queue. Each entry
    contains a CPU and a tail queue counter. The CPU is the "current"
    CPU for a matching flow. The tail queue counter holds the value
    of a tail queue counter for the associated CPU's backlog queue at
    the time of last enqueue for a flow matching the entry.

    Each backlog queue has a queue head counter which is incremented
    on dequeue, and so a queue tail counter is computed as queue head
    count + queue length. When a packet is enqueued on a backlog queue,
    the current value of the queue tail counter is saved in the hash
    entry of the rps_dev_flow_table.

    And now the trick: when selecting the CPU for RPS (get_rps_cpu)
    the rps_sock_flow table and the rps_dev_flow table for the RX queue
    are consulted. When the desired CPU for the flow (found in the
    rps_sock_flow table) does not match the current CPU (found in the
    rps_dev_flow table), the current CPU is changed to the desired CPU
    if one of the following is true:

    - The current CPU is unset (equal to RPS_NO_CPU)
    - Current CPU is offline
    - The current CPU's queue head counter >= queue tail counter in the
    rps_dev_flow table. This checks if the queue tail has advanced
    beyond the last packet that was enqueued using this table entry.
    This guarantees that all packets queued using this entry have been
    dequeued, thus preserving in order delivery.

    Making each queue have its own rps_dev_flow table has two advantages:
    1) the tail queue counters will be written on each receive, so
    keeping the table local to interrupting CPU s good for locality. 2)
    this allows lockless access to the table-- the CPU number and queue
    tail counter need to be accessed together under mutual exclusion
    from netif_receive_skb, we assume that this is only called from
    device napi_poll which is non-reentrant.

    This patch implements RFS for TCP and connected UDP sockets.
    It should be usable for other flow oriented protocols.

    There are two configuration parameters for RFS. The
    "rps_flow_entries" kernel init parameter sets the number of
    entries in the rps_sock_flow_table, the per rxqueue sysfs entry
    "rps_flow_cnt" contains the number of entries in the rps_dev_flow
    table for the rxqueue. Both are rounded to power of two.

    The obvious benefit of RFS (over just RPS) is that it achieves
    CPU locality between the receive processing for a flow and the
    applications processing; this can result in increased performance
    (higher pps, lower latency).

    The benefits of RFS are dependent on cache hierarchy, application
    load, and other factors. On simple benchmarks, we don't necessarily
    see improvement and sometimes see degradation. However, for more
    complex benchmarks and for applications where cache pressure is
    much higher this technique seems to perform very well.

    Below are some benchmark results which show the potential benfit of
    this patch. The netperf test has 500 instances of netperf TCP_RR
    test with 1 byte req. and resp. The RPC test is an request/response
    test similar in structure to netperf RR test ith 100 threads on
    each host, but does more work in userspace that netperf.

    e1000e on 8 core Intel
    No RFS or RPS 104K tps at 30% CPU
    No RFS (best RPS config): 290K tps at 63% CPU
    RFS 303K tps at 61% CPU

    RPC test tps CPU% 50/90/99% usec latency Latency StdDev
    No RFS/RPS 103K 48% 757/900/3185 4472.35
    RPS only: 174K 73% 415/993/2468 491.66
    RFS 223K 73% 379/651/1382 315.61

    Signed-off-by: Tom Herbert
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Tom Herbert
     

11 Apr, 2010

1 commit


09 Apr, 2010

1 commit

  • Commits 5051ebd275de672b807c28d93002c2fb0514a3c9 and
    5051ebd275de672b807c28d93002c2fb0514a3c9 ("ipv[46]: udp: optimize unicast RX
    path") broke some programs.

    After upgrading a L2TP server to 2.6.33 it started to fail, tunnels going up an
    down, after the 10th tunnel came up. My modified rp-l2tp uses a global
    unconnected socket bound to (INADDR_ANY, 1701) and one connected socket per
    tunnel after parameter negotiation.

    After ten sockets were open and due to mixed parameters to
    udp[46]_lib_lookup2() kernel started to drop packets.

    Signed-off-by: Jorge Boncompte [DTI2]
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Jorge Boncompte [DTI2]
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

06 Mar, 2010

2 commits

  • sk_add_backlog -> __sk_add_backlog
    sk_add_backlog_limited -> sk_add_backlog

    Signed-off-by: Zhu Yi
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Zhu Yi
     
  • Make udp adapt to the limited socket backlog change.

    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: "Pekka Savola (ipv6)"
    Cc: Patrick McHardy
    Signed-off-by: Zhu Yi
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Zhu Yi
     

13 Feb, 2010

1 commit

  • The variable 'copied' is used in udp_recvmsg() to emphasize that the passed
    'len' is adjusted to fit the actual datagram length. But the same can be
    done by adjusting 'len' directly. This patch thus removes the indirection.

    Signed-off-by: Gerrit Renker
    Signed-off-by: David S. Miller

    Gerrit Renker
     

18 Jan, 2010

1 commit


14 Dec, 2009

1 commit

  • Now we can have a large udp hash table, udp_lib_get_port() loop
    should be converted to a do {} while (cond) form,
    or we dont enter it at all if hash table size is exactly 65536.

    Reported-by: Yinghai Lu
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Nov, 2009

1 commit

  • On Sun, 2009-11-22 at 16:31 -0800, David Miller wrote:
    > It should be of the form:
    > if (x &&
    > y)
    >
    > or:
    > if (x && y)
    >
    > Fix patches, rather than complaints, for existing cases where things
    > do not follow this pattern are certainly welcome.

    Also collapsed some multiple tabs to single space.

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     

11 Nov, 2009

1 commit


09 Nov, 2009

6 commits

  • When skb_clone() fails, we should increment sk_drops and SNMP counters.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • UDP multicast rx path is a bit complex and can hold a spinlock
    for a long time.

    Using a small (32 or 64 entries) stack of socket pointers can help
    to perform expensive operations (skb_clone(), udp_queue_rcv_skb())
    outside of the lock, in most cases.

    It's also a base for a future RCU conversion of multicast recption.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Lucian Adrian Grijincu
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • We first locate the (local port) hash chain head
    If few sockets are in this chain, we proceed with previous lookup algo.

    If too many sockets are listed, we take a look at the secondary
    (port, address) hash chain we added in previous patch.

    We choose the shortest chain and proceed with a RCU lookup on the elected chain.

    But, if we chose (port, address) chain, and fail to find a socket on given address,
    we must try another lookup on (port, INADDR_ANY) chain to find socket not bound
    to a particular IP.

    -> No extra cost for typical setups, where the first lookup will probabbly
    be performed.

    RCU lookups everywhere, we dont acquire spinlock.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Extends udp_table to contain a secondary hash table.

    socket anchor for this second hash is free, because UDP
    doesnt use skc_bind_node : We define an union to hold
    both skc_bind_node & a new hlist_nulls_node udp_portaddr_node

    udp_lib_get_port() inserts sockets into second hash chain
    (additional cost of one atomic op)

    udp_lib_unhash() deletes socket from second hash chain
    (additional cost of one atomic op)

    Note : No spinlock lockdep annotation is needed, because
    lock for the secondary hash chain is always get after
    lock for primary hash chain.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Union sk_hash with two u16 hashes for udp (no extra memory taken)

    One 16 bits hash on (local port) value (the previous udp 'hash')

    One 16 bits hash on (local address, local port) values, initialized
    but not yet used. This second hash is using jenkin hash for better
    distribution.

    Because the 'port' is xored later, a partial hash is performed
    on local address + net_hash_mix(net)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Adds a counter in udp_hslot to keep an accurate count
    of sockets present in chain.

    This will permit to upcoming UDP lookup algo to chose
    the shortest chain when secondary hash is added.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 Nov, 2009

1 commit


31 Oct, 2009

1 commit

  • On UDP sockets, we must call skb_free_datagram() with socket locked,
    or risk sk_forward_alloc corruption. This requirement is not respected
    in SUNRPC.

    Add a convenient helper, skb_free_datagram_locked() and use it in SUNRPC

    Reported-by: Francis Moreau
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 Oct, 2009

2 commits

  • - skb_kill_datagram() can increment sk->sk_drops itself, not callers.

    - UDP on IPV4 & IPV6 dropped frames (because of bad checksum or policy checks) increment sk_drops

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • In order to have better cache layouts of struct sock (separate zones
    for rx/tx paths), we need this preliminary patch.

    Goal is to transfert fields used at lookup time in the first
    read-mostly cache line (inside struct sock_common) and move sk_refcnt
    to a separate cache line (only written by rx path)

    This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
    sport and id fields. This allows a future patch to define these
    fields as macros, like sk_refcnt, without name clashes.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

15 Oct, 2009

1 commit

  • sock_queue_rcv_skb() can update sk_drops itself, removing need for
    callers to take care of it. This is more consistent since
    sock_queue_rcv_skb() also reads sk_drops when queueing a skb.

    This adds sk_drops managment to many protocols that not cared yet.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Oct, 2009

1 commit


13 Oct, 2009

2 commits

  • udp_poll() can in some circumstances drop frames with incorrect checksums.

    Problem is we now have to lock the socket while dropping frames, or risk
    sk_forward corruption.

    This bug is present since commit 95766fff6b9a78d1
    ([UDP]: Add memory accounting.)

    While we are at it, we can correct ioctl(SIOCINQ) to also drop bad frames.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Create a new socket level option to report number of queue overflows

    Recently I augmented the AF_PACKET protocol to report the number of frames lost
    on the socket receive queue between any two enqueued frames. This value was
    exported via a SOL_PACKET level cmsg. AFter I completed that work it was
    requested that this feature be generalized so that any datagram oriented socket
    could make use of this option. As such I've created this patch, It creates a
    new SOL_SOCKET level option called SO_RXQ_OVFL, which when enabled exports a
    SOL_SOCKET level cmsg that reports the nubmer of times the sk_receive_queue
    overflowed between any two given frames. It also augments the AF_PACKET
    protocol to take advantage of this new feature (as it previously did not touch
    sk->sk_drops, which this patch uses to record the overflow count). Tested
    successfully by me.

    Notes:

    1) Unlike my previous patch, this patch simply records the sk_drops value, which
    is not a number of drops between packets, but rather a total number of drops.
    Deltas must be computed in user space.

    2) While this patch currently works with datagram oriented protocols, it will
    also be accepted by non-datagram oriented protocols. I'm not sure if thats
    agreeable to everyone, but my argument in favor of doing so is that, for those
    protocols which aren't applicable to this option, sk_drops will always be zero,
    and reporting no drops on a receive queue that isn't used for those
    non-participating protocols seems reasonable to me. This also saves us having
    to code in a per-protocol opt in mechanism.

    3) This applies cleanly to net-next assuming that commit
    977750076d98c7ff6cbda51858bb5a5894a9d9ab (my af packet cmsg patch) is reverted

    Signed-off-by: Neil Horman
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neil Horman