23 Feb, 2010

1 commit

  • Convert AF_PACKET to use RCU, eliminating one more reader/writer lock.

    There is no need for a real sk_del_node_init_rcu(), because sk_del_node_init
    is doing the equivalent thing to hlst_del_init_rcu already; but added
    some comments to try and make that obvious.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    stephen hemminger
     

15 Feb, 2010

1 commit


11 Feb, 2010

1 commit


09 Nov, 2009

2 commits

  • Extends udp_table to contain a secondary hash table.

    socket anchor for this second hash is free, because UDP
    doesnt use skc_bind_node : We define an union to hold
    both skc_bind_node & a new hlist_nulls_node udp_portaddr_node

    udp_lib_get_port() inserts sockets into second hash chain
    (additional cost of one atomic op)

    udp_lib_unhash() deletes socket from second hash chain
    (additional cost of one atomic op)

    Note : No spinlock lockdep annotation is needed, because
    lock for the secondary hash chain is always get after
    lock for primary hash chain.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Union sk_hash with two u16 hashes for udp (no extra memory taken)

    One 16 bits hash on (local port) value (the previous udp 'hash')

    One 16 bits hash on (local address, local port) values, initialized
    but not yet used. This second hash is using jenkin hash for better
    distribution.

    Because the 'port' is xored later, a partial hash is performed
    on local address + net_hash_mix(net)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Oct, 2009

1 commit

  • Introduce sk_tx_queue_mapping; and functions that set, test and
    get this value. Reset sk_tx_queue_mapping to -1 whenever the dst
    cache is set/reset, and in socket alloc. Setting txq to -1 and
    using valid txq= allows the tx path to use the value
    of sk_tx_queue_mapping directly instead of subtracting 1 on every
    tx.

    Signed-off-by: Krishna Kumar
    Signed-off-by: David S. Miller

    Krishna Kumar
     

14 Oct, 2009

1 commit


13 Oct, 2009

1 commit

  • Create a new socket level option to report number of queue overflows

    Recently I augmented the AF_PACKET protocol to report the number of frames lost
    on the socket receive queue between any two enqueued frames. This value was
    exported via a SOL_PACKET level cmsg. AFter I completed that work it was
    requested that this feature be generalized so that any datagram oriented socket
    could make use of this option. As such I've created this patch, It creates a
    new SOL_SOCKET level option called SO_RXQ_OVFL, which when enabled exports a
    SOL_SOCKET level cmsg that reports the nubmer of times the sk_receive_queue
    overflowed between any two given frames. It also augments the AF_PACKET
    protocol to take advantage of this new feature (as it previously did not touch
    sk->sk_drops, which this patch uses to record the overflow count). Tested
    successfully by me.

    Notes:

    1) Unlike my previous patch, this patch simply records the sk_drops value, which
    is not a number of drops between packets, but rather a total number of drops.
    Deltas must be computed in user space.

    2) While this patch currently works with datagram oriented protocols, it will
    also be accepted by non-datagram oriented protocols. I'm not sure if thats
    agreeable to everyone, but my argument in favor of doing so is that, for those
    protocols which aren't applicable to this option, sk_drops will always be zero,
    and reporting no drops on a receive queue that isn't used for those
    non-participating protocols seems reasonable to me. This also saves us having
    to code in a per-protocol opt in mechanism.

    3) This applies cleanly to net-next assuming that commit
    977750076d98c7ff6cbda51858bb5a5894a9d9ab (my af packet cmsg patch) is reverted

    Signed-off-by: Neil Horman
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neil Horman
     

12 Oct, 2009

1 commit

  • Since commit a98b65a3 (net: annotate struct sock bitfield), we lost
    8 bytes in struct sock on 64bit arches because of
    kmemcheck_bitfield_end(flags) misplacement.

    Fix this by putting together sk_shutdown, sk_no_check, sk_userlocks,
    sk_protocol and sk_type in the 'flags' 32bits bitfield

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Oct, 2009

1 commit

  • An incoming datagram must bring into cpu cache *lot* of cache lines,
    in particular : (other parts omitted (hash chains, ip route cache...))

    On 32bit arches :

    offsetof(struct sock, sk_rcvbuf) =0x30 (read)
    offsetof(struct sock, sk_lock) =0x34 (rw)

    offsetof(struct sock, sk_sleep) =0x50 (read)
    offsetof(struct sock, sk_rmem_alloc) =0x64 (rw)
    offsetof(struct sock, sk_receive_queue)=0x74 (rw)

    offsetof(struct sock, sk_forward_alloc)=0x98 (rw)

    offsetof(struct sock, sk_callback_lock)=0xcc (rw)
    offsetof(struct sock, sk_drops) =0xd8 (read if we add dropcount support, rw if frame dropped)
    offsetof(struct sock, sk_filter) =0xf8 (read)

    offsetof(struct sock, sk_socket) =0x138 (read)

    offsetof(struct sock, sk_data_ready) =0x15c (read)

    We can avoid sk->sk_socket and socket->fasync_list referencing on sockets
    with no fasync() structures. (socket->fasync_list ptr is probably already in cache
    because it shares a cache line with socket->wait, ie location pointed by sk->sk_sleep)

    This avoids one cache line load per incoming packet for common cases (no fasync())

    We can leave (or even move in a future patch) sk->sk_socket in a cold location

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Oct, 2009

1 commit

  • This provides safety against negative optlen at the type
    level instead of depending upon (sometimes non-trivial)
    checks against this sprinkled all over the the place, in
    each and every implementation.

    Based upon work done by Arjan van de Ven and feedback
    from Linus Torvalds.

    Signed-off-by: David S. Miller

    David S. Miller
     

17 Jul, 2009

1 commit

  • Commit e912b1142be8f1e2c71c71001dc992c6e5eb2ec1
    (net: sk_prot_alloc() should not blindly overwrite memory)
    took care of not zeroing whole new socket at allocation time.

    sock_copy() is another spot where we should be very careful.
    We should not set refcnt to a non null value, until
    we are sure other fields are correctly setup, or
    a lockless reader could catch this socket by mistake,
    while not fully (re)initialized.

    This patch puts sk_node & sk_refcnt to the very beginning
    of struct sock to ease sock_copy() & sk_prot_alloc() job.

    We add appropriate smp_wmb() before sk_refcnt initializations
    to match our RCU requirements (changes to sock keys should
    be committed to memory before sk_refcnt setting)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Jul, 2009

2 commits

  • Adding smp_mb__after_lock define to be used as a smp_mb call after
    a lock.

    Making it nop for x86, since {read|write|spin}_lock() on x86 are
    full memory barriers.

    Signed-off-by: Jiri Olsa
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Jiri Olsa
     
  • Adding memory barrier after the poll_wait function, paired with
    receive callbacks. Adding fuctions sock_poll_wait and sk_has_sleeper
    to wrap the memory barrier.

    Without the memory barrier, following race can happen.
    The race fires, when following code paths meet, and the tp->rcv_nxt
    and __add_wait_queue updates stay in CPU caches.

    CPU1 CPU2

    sys_select receive packet
    ... ...
    __add_wait_queue update tp->rcv_nxt
    ... ...
    tp->rcv_nxt check sock_def_readable
    ... {
    schedule ...
    if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
    wake_up_interruptible(sk->sk_sleep)
    ...
    }

    If there was no cache the code would work ok, since the wait_queue and
    rcv_nxt are opposit to each other.

    Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
    passed the tp->rcv_nxt check and sleeps, or will get the new value for
    tp->rcv_nxt and will return with new data mask.
    In both cases the process (CPU1) is being added to the wait queue, so the
    waitqueue_active (CPU2) call cannot miss and will wake up CPU1.

    The bad case is when the __add_wait_queue changes done by CPU1 stay in its
    cache, and so does the tp->rcv_nxt update on CPU2 side. The CPU1 will then
    endup calling schedule and sleep forever if there are no more data on the
    socket.

    Calls to poll_wait in following modules were ommited:
    net/bluetooth/af_bluetooth.c
    net/irda/af_irda.c
    net/irda/irnet/irnet_ppp.c
    net/mac80211/rc80211_pid_debugfs.c
    net/phonet/socket.c
    net/rds/af_rds.c
    net/rfkill/core.c
    net/sunrpc/cache.c
    net/sunrpc/rpc_pipe.c
    net/tipc/socket.c

    Signed-off-by: Jiri Olsa
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Jiri Olsa
     

25 Jun, 2009

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6:
    bnx2: Fix the behavior of ethtool when ONBOOT=no
    qla3xxx: Don't sleep while holding lock.
    qla3xxx: Give the PHY time to come out of reset.
    ipv4 routing: Ensure that route cache entries are usable and reclaimable with caching is off
    net: Move rx skb_orphan call to where needed
    ipv6: Use correct data types for ICMPv6 type and code
    net: let KS8842 driver depend on HAS_IOMEM
    can: let SJA1000 driver depend on HAS_IOMEM
    netxen: fix firmware init handshake
    netxen: fix build with without CONFIG_PM
    netfilter: xt_rateest: fix comparison with self
    netfilter: xt_quota: fix incomplete initialization
    netfilter: nf_log: fix direct userspace memory access in proc handler
    netfilter: fix some sparse endianess warnings
    netfilter: nf_conntrack: fix conntrack lookup race
    netfilter: nf_conntrack: fix confirmation race condition
    netfilter: nf_conntrack: death_by_timeout() fix

    Linus Torvalds
     

24 Jun, 2009

1 commit

  • In order to get the tun driver to account packets, we need to be
    able to receive packets with destructors set. To be on the safe
    side, I added an skb_orphan call for all protocols by default since
    some of them (IP in particular) cannot handle receiving packets
    destructors properly.

    Now it seems that at least one protocol (CAN) expects to be able
    to pass skb->sk through the rx path without getting clobbered.

    So this patch attempts to fix this properly by moving the skb_orphan
    call to where it's actually needed. In particular, I've added it
    to skb_set_owner_[rw] which is what most users of skb->destructor
    call.

    This is actually an improvement for tun too since it means that
    we only give back the amount charged to the socket when the skb
    is passed to another socket that will also be charged accordingly.

    Signed-off-by: Herbert Xu
    Tested-by: Oliver Hartkopp
    Signed-off-by: David S. Miller

    Herbert Xu
     

19 Jun, 2009

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (55 commits)
    netxen: fix tx ring accounting
    netxen: fix detection of cut-thru firmware mode
    forcedeth: fix dma api mismatches
    atm: sk_wmem_alloc initial value is one
    net: correct off-by-one write allocations reports
    via-velocity : fix no link detection on boot
    Net / e100: Fix suspend of devices that cannot be power managed
    TI DaVinci EMAC : Fix rmmod error
    net: group address list and its count
    ipv4: Fix fib_trie rebalancing, part 2
    pkt_sched: Update drops stats in act_police
    sky2: version 1.23
    sky2: add GRO support
    sky2: skb recycling
    sky2: reduce default transmit ring
    sky2: receive counter update
    sky2: fix shutdown synchronization
    sky2: PCI irq issues
    sky2: more receive shutdown
    sky2: turn off pause during shutdown
    ...

    Manually fix trivial conflict in net/core/skbuff.c due to kmemcheck

    Linus Torvalds
     

17 Jun, 2009

2 commits

  • commit 2b85a34e911bf483c27cfdd124aeb1605145dc80
    (net: No more expensive sock_hold()/sock_put() on each tx)
    changed initial sk_wmem_alloc value.

    Some protocols check sk_wmem_alloc value to determine if a timer
    must delay socket deallocation. We must take care of the sk_wmem_alloc
    value being one instead of zero when no write allocations are pending.

    Reported by Ingo Molnar, and full diagnostic from David Miller.

    This patch introduces three helpers to get read/write allocations
    and a followup patch will use these helpers to report correct
    write allocations to user.

    Reported-by: Ingo Molnar
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • * 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/vegard/kmemcheck: (39 commits)
    signal: fix __send_signal() false positive kmemcheck warning
    fs: fix do_mount_root() false positive kmemcheck warning
    fs: introduce __getname_gfp()
    trace: annotate bitfields in struct ring_buffer_event
    net: annotate struct sock bitfield
    c2port: annotate bitfield for kmemcheck
    net: annotate inet_timewait_sock bitfields
    ieee1394/csr1212: fix false positive kmemcheck report
    ieee1394: annotate bitfield
    net: annotate bitfields in struct inet_sock
    net: use kmemcheck bitfields API for skbuff
    kmemcheck: introduce bitfield API
    kmemcheck: add opcode self-testing at boot
    x86: unify pte_hidden
    x86: make _PAGE_HIDDEN conditional
    kmemcheck: make kconfig accessible for other architectures
    kmemcheck: enable in the x86 Kconfig
    kmemcheck: add hooks for the page allocator
    kmemcheck: add hooks for page- and sg-dma-mappings
    kmemcheck: don't track page tables
    ...

    Linus Torvalds
     

15 Jun, 2009

1 commit

  • 2009/2/24 Ingo Molnar :
    > ok, this is the last warning i have from today's overnight -tip
    > testruns - a 32-bit system warning in sock_init_data():
    >
    > [ 2.610389] NET: Registered protocol family 16
    > [ 2.616138] initcall netlink_proto_init+0x0/0x170 returned 0 after 7812 usecs
    > [ 2.620010] WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (f642c184)
    > [ 2.624002] 010000000200000000000000604990c000000000000000000000000000000000
    > [ 2.634076] i i i i i i u u i i i i i i i i i i i i i i i i i i i i i i i i
    > [ 2.641038] ^
    > [ 2.643376]
    > [ 2.644004] Pid: 1, comm: swapper Not tainted (2.6.29-rc6-tip-01751-g4d1c22c-dirty #885)
    > [ 2.648003] EIP: 0060:[] EFLAGS: 00010282 CPU: 0
    > [ 2.652008] EIP is at sock_init_data+0xa1/0x190
    > [ 2.656003] EAX: 0001a800 EBX: f6836c00 ECX: 00463000 EDX: c0e46fe0
    > [ 2.660003] ESI: f642c180 EDI: c0b83088 EBP: f6863ed8 ESP: c0c412ec
    > [ 2.664003] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
    > [ 2.668003] CR0: 8005003b CR2: f682c400 CR3: 00b91000 CR4: 000006f0
    > [ 2.672003] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
    > [ 2.676003] DR6: ffff4ff0 DR7: 00000400
    > [ 2.680002] [] __netlink_create+0x35/0xa0
    > [ 2.684002] [] netlink_kernel_create+0x4c/0x140
    > [ 2.688002] [] rtnetlink_net_init+0x1e/0x40
    > [ 2.696002] [] register_pernet_operations+0x11/0x30
    > [ 2.700002] [] register_pernet_subsys+0x1c/0x30
    > [ 2.704002] [] rtnetlink_init+0x4c/0x100
    > [ 2.708002] [] netlink_proto_init+0x159/0x170
    > [ 2.712002] [] do_one_initcall+0x24/0x150
    > [ 2.716002] [] do_initcalls+0x27/0x40
    > [ 2.723201] [] do_basic_setup+0x1c/0x20
    > [ 2.728002] [] kernel_init+0x5a/0xa0
    > [ 2.732002] [] kernel_thread_helper+0x7/0x10
    > [ 2.736002] [] 0xffffffff

    We fix this false positive by annotating the bitfield in struct
    sock.

    Reported-by: Ingo Molnar
    Signed-off-by: Vegard Nossum

    Vegard Nossum
     

11 Jun, 2009

1 commit

  • One of the problem with sock memory accounting is it uses
    a pair of sock_hold()/sock_put() for each transmitted packet.

    This slows down bidirectional flows because the receive path
    also needs to take a refcount on socket and might use a different
    cpu than transmit path or transmit completion path. So these
    two atomic operations also trigger cache line bounces.

    We can see this in tx or tx/rx workloads (media gateways for example),
    where sock_wfree() can be in top five functions in profiles.

    We use this sock_hold()/sock_put() so that sock freeing
    is delayed until all tx packets are completed.

    As we also update sk_wmem_alloc, we could offset sk_wmem_alloc
    by one unit at init time, until sk_free() is called.
    Once sk_free() is called, we atomic_dec_and_test(sk_wmem_alloc)
    to decrement initial offset and atomicaly check if any packets
    are in flight.

    skb_set_owner_w() doesnt call sock_hold() anymore

    sock_wfree() doesnt call sock_put() anymore, but check if sk_wmem_alloc
    reached 0 to perform the final freeing.

    Drawback is that a skb->truesize error could lead to unfreeable sockets, or
    even worse, prematurely calling __sk_free() on a live socket.

    Nice speedups on SMP. tbench for example, going from 2691 MB/s to 2711 MB/s
    on my 8 cpu dev machine, even if tbench was not really hitting sk_refcnt
    contention point. 5 % speedup on a UDP transmit workload (depends
    on number of flows), lowering TX completion cpu usage.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Feb, 2009

1 commit


18 Feb, 2009

1 commit

  • A long time ago we had bugs, primarily in TCP, where we would modify
    skb->truesize (for TSO queue collapsing) in ways which would corrupt
    the socket memory accounting.

    skb_truesize_check() was added in order to try and catch this error
    more systematically.

    However this debugging check has morphed into a Frankenstein of sorts
    and these days it does nothing other than catch false-positives.

    Signed-off-by: David S. Miller

    David S. Miller
     

16 Feb, 2009

1 commit

  • The overlap with the old SO_TIMESTAMP[NS] options is handled so
    that time stamping in software (net_enable_timestamp()) is
    enabled when SO_TIMESTAMP[NS] and/or SO_TIMESTAMPING_RX_SOFTWARE
    is set. It's disabled if all of these are off.

    Signed-off-by: Patrick Ohly
    Signed-off-by: David S. Miller

    Patrick Ohly
     

15 Feb, 2009

1 commit


13 Feb, 2009

1 commit

  • The problem is that in_atomic() will return false inside spinlocks if
    CONFIG_PREEMPT=n. This will lead to deadlockable GFP_KERNEL allocations
    from spinlocked regions.

    Secondly, if CONFIG_PREEMPT=y, this bug solves itself because networking
    will instead use GFP_ATOMIC from this callsite. Hence we won't get the
    might_sleep() debugging warnings which would have informed us of the buggy
    callsites.

    Solve both these problems by switching to in_interrupt(). Now, if someone
    runs a gfp_any() allocation from inside spinlock we will get the warning
    if CONFIG_PREEMPT=y.

    I reviewed all callsites and most of them were too complex for my little
    brain and none of them documented their interface requirements. I have no
    idea what this patch will do.

    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Andrew Morton
     

05 Feb, 2009

1 commit

  • The function sock_alloc_send_pskb is completely useless if not
    exported since most of the code in it won't be used as is. In
    fact, this code has already been duplicated in the tun driver.

    Now that we need accounting in the tun driver, we can in fact
    use this function as is. So this patch marks it for export again.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

26 Nov, 2008

2 commits

  • Instead of using one atomic_t per protocol, use a percpu_counter
    for "orphan_count", to reduce cache line contention on
    heavy duty network servers.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Instead of using one atomic_t per protocol, use a percpu_counter
    for "sockets_allocated", to reduce cache line contention on
    heavy duty network servers.

    Note : We revert commit (248969ae31e1b3276fc4399d67ce29a5d81e6fd9
    net: af_unix can make unix_nr_socks visbile in /proc),
    since it is not anymore used after sock_prot_inuse_add() addition

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 Nov, 2008

1 commit


17 Nov, 2008

1 commit

  • This is a straightforward patch, using hlist_nulls infrastructure.

    RCUification already done on UDP two weeks ago.

    Using hlist_nulls permits us to avoid some memory barriers, both
    at lookup time and delete time.

    Patch is large because it adds new macros to include/net/sock.h.
    These macros will be used by TCP & DCCP in next patch.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Nov, 2008

1 commit

  • fix this warning:

    net/bluetooth/af_bluetooth.c:60: warning: ‘bt_key_strings’ defined but not used
    net/bluetooth/af_bluetooth.c:71: warning: ‘bt_slock_key_strings’ defined but not used

    this is a lockdep macro problem in the !LOCKDEP case.

    We cannot convert it to an inline because the macro works on multiple types,
    but we can mark the parameter used.

    [ also clean up a misaligned tab in sock_lock_init_class_and_name() ]

    [ also remove #ifdefs from around af_family_clock_key strings - which
    were certainly added to get rid of the ugly build warnings. ]

    Signed-off-by: Ingo Molnar
    Signed-off-by: David S. Miller

    Ingo Molnar
     

13 Nov, 2008

1 commit


05 Nov, 2008

1 commit


31 Oct, 2008

2 commits


30 Oct, 2008

1 commit

  • Corey Minyard found a race added in commit 271b72c7fa82c2c7a795bc16896149933110672d
    (udp: RCU handling for Unicast packets.)

    "If the socket is moved from one list to another list in-between the
    time the hash is calculated and the next field is accessed, and the
    socket has moved to the end of the new list, the traversal will not
    complete properly on the list it should have, since the socket will
    be on the end of the new list and there's not a way to tell it's on a
    new list and restart the list traversal. I think that this can be
    solved by pre-fetching the "next" field (with proper barriers) before
    checking the hash."

    This patch corrects this problem, introducing a new
    sk_for_each_rcu_safenext() macro.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

29 Oct, 2008

3 commits

  • Goals are :

    1) Optimizing handling of incoming Unicast UDP frames, so that no memory
    writes should happen in the fast path.

    Note: Multicasts and broadcasts still will need to take a lock,
    because doing a full lockless lookup in this case is difficult.

    2) No expensive operations in the socket bind/unhash phases :
    - No expensive synchronize_rcu() calls.

    - No added rcu_head in socket structure, increasing memory needs,
    but more important, forcing us to use call_rcu() calls,
    that have the bad property of making sockets structure cold.
    (rcu grace period between socket freeing and its potential reuse
    make this socket being cold in CPU cache).
    David did a previous patch using call_rcu() and noticed a 20%
    impact on TCP connection rates.
    Quoting Cristopher Lameter :
    "Right. That results in cacheline cooldown. You'd want to recycle
    the object as they are cache hot on a per cpu basis. That is screwed
    up by the delayed regular rcu processing. We have seen multiple
    regressions due to cacheline cooldown.
    The only choice in cacheline hot sensitive areas is to deal with the
    complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."

    - Because udp sockets are allocated from dedicated kmem_cache,
    use of SLAB_DESTROY_BY_RCU can help here.

    Theory of operation :
    ---------------------

    As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
    special attention must be taken by readers and writers.

    Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
    reused, inserted in a different chain or in worst case in the same chain
    while readers could do lookups in the same time.

    In order to avoid loops, a reader must check each socket found in a chain
    really belongs to the chain the reader was traversing. If it finds a
    mismatch, lookup must start again at the begining. This *restart* loop
    is the reason we had to use rdlock for the multicast case, because
    we dont want to send same message several times to the same socket.

    We use RCU only for fast path.
    Thus, /proc/net/udp still takes spinlocks.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • UDP sockets are hashed in a 128 slots hash table.

    This hash table is protected by *one* rwlock.

    This rwlock is readlocked each time an incoming UDP message is handled.

    This rwlock is writelocked each time a socket must be inserted in
    hash table (bind time), or deleted from this table (close time)

    This is not scalable on SMP machines :

    1) Even in read mode, lock() and unlock() are atomic operations and
    must dirty a contended cache line, shared by all cpus.

    2) A writer might be starved if many readers are 'in flight'. This can
    happen on a machine with some NIC receiving many UDP messages. User
    process can be delayed a long time at socket creation/dismantle time.

    This patch prepares RCU migration, by introducing 'struct udp_table
    and struct udp_hslot', and using one spinlock per chain, to reduce
    contention on central rwlock.

    Introducing one spinlock per chain reduces latencies, for port
    randomization on heavily loaded UDP servers. This also speedup
    bindings to specific ports.

    udp_lib_unhash() was uninlined, becoming to big.

    Some cleanups were done to ease review of following patch
    (RCUification of UDP Unicast lookups)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • ifdef out
    * struct sk_buff::sp (pointer)
    * struct dst_entry::xfrm (pointer)
    * struct sock::sk_policy (2 pointers)

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: David S. Miller

    Alexey Dobriyan