Eric Lee / smarc-fsl-linux-kernel

23 Feb, 2010

1 commit

808f5114a packet: convert socket list to RCU (v3) ... Browse Code »

Convert AF_PACKET to use RCU, eliminating one more reader/writer lock.

There is no need for a real sk_del_node_init_rcu(), because sk_del_node_init
is doing the equivalent thing to hlst_del_init_rcu already; but added
some comments to try and make that obvious.

Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller

stephen hemminger
2010-02-23 07:45:56 +0800

15 Feb, 2010

1 commit

1a5778aa0 net: Fix first line of kernel-doc for a few functions ... Browse Code »

The function name must be followed by a space, hypen, space, and a
short description.

Signed-off-by: Ben Hutchings
Signed-off-by: David S. Miller

Ben Hutchings
2010-02-15 14:35:47 +0800

11 Feb, 2010

1 commit

c4146644a net: add a wrapper sk_entry() ... Browse Code »

Signed-off-by: Li Zefan
Signed-off-by: David S. Miller

Li Zefan
2010-02-11 03:12:07 +0800

09 Nov, 2009

2 commits

512615b6b udp: secondary hash on (local port, local address) ... Browse Code »

Extends udp_table to contain a secondary hash table.

socket anchor for this second hash is free, because UDP
doesnt use skc_bind_node : We define an union to hold
both skc_bind_node & a new hlist_nulls_node udp_portaddr_node

udp_lib_get_port() inserts sockets into second hash chain
(additional cost of one atomic op)

udp_lib_unhash() deletes socket from second hash chain
(additional cost of one atomic op)

Note : No spinlock lockdep annotation is needed, because
lock for the secondary hash chain is always get after
lock for primary hash chain.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2009-11-09 12:53:06 +0800
d4cada4ae udp: split sk_hash into two u16 hashes ... Browse Code »

Union sk_hash with two u16 hashes for udp (no extra memory taken)

One 16 bits hash on (local port) value (the previous udp 'hash')

One 16 bits hash on (local address, local port) values, initialized
but not yet used. This second hash is using jenkin hash for better
distribution.

Because the 'port' is xored later, a partial hash is performed
on local address + net_hash_mix(net)

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2009-11-09 12:53:05 +0800

21 Oct, 2009

1 commit

e022f0b4a net: Introduce sk_tx_queue_mapping ... Browse Code »

Introduce sk_tx_queue_mapping; and functions that set, test and
get this value. Reset sk_tx_queue_mapping to -1 whenever the dst
cache is set/reset, and in socket alloc. Setting txq to -1 and
using valid txq= allows the tx path to use the value
of sk_tx_queue_mapping directly instead of subtracting 1 on every
tx.

Signed-off-by: Krishna Kumar
Signed-off-by: David S. Miller

Krishna Kumar
2009-10-21 09:55:45 +0800

14 Oct, 2009

1 commit

421355de8 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Browse Code »

David S. Miller
2009-10-14 03:55:20 +0800

13 Oct, 2009

1 commit

3b885787e net: Generalize socket rx gap / receive queue overflow cmsg ... Browse Code »

Create a new socket level option to report number of queue overflows

Recently I augmented the AF_PACKET protocol to report the number of frames lost
on the socket receive queue between any two enqueued frames. This value was
exported via a SOL_PACKET level cmsg. AFter I completed that work it was
requested that this feature be generalized so that any datagram oriented socket
could make use of this option. As such I've created this patch, It creates a
new SOL_SOCKET level option called SO_RXQ_OVFL, which when enabled exports a
SOL_SOCKET level cmsg that reports the nubmer of times the sk_receive_queue
overflowed between any two given frames. It also augments the AF_PACKET
protocol to take advantage of this new feature (as it previously did not touch
sk->sk_drops, which this patch uses to record the overflow count). Tested
successfully by me.

Notes:

1) Unlike my previous patch, this patch simply records the sk_drops value, which
is not a number of drops between packets, but rather a total number of drops.
Deltas must be computed in user space.

2) While this patch currently works with datagram oriented protocols, it will
also be accepted by non-datagram oriented protocols. I'm not sure if thats
agreeable to everyone, but my argument in favor of doing so is that, for those
protocols which aren't applicable to this option, sk_drops will always be zero,
and reporting no drops on a receive queue that isn't used for those
non-participating protocols seems reasonable to me. This also saves us having
to code in a per-protocol opt in mechanism.

3) This applies cleanly to net-next assuming that commit
977750076d98c7ff6cbda51858bb5a5894a9d9ab (my af packet cmsg patch) is reverted

Signed-off-by: Neil Horman
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Neil Horman
2009-10-13 04:26:31 +0800

12 Oct, 2009

1 commit

5fdb9973c net: Fix struct sock bitfield annotation ... Browse Code »

Since commit a98b65a3 (net: annotate struct sock bitfield), we lost
8 bytes in struct sock on 64bit arches because of
kmemcheck_bitfield_end(flags) misplacement.

Fix this by putting together sk_shutdown, sk_no_check, sk_userlocks,
sk_protocol and sk_type in the 'flags' 32bits bitfield

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2009-10-12 14:03:52 +0800

07 Oct, 2009

1 commit

bcdce7195 net: speedup sk_wake_async() ... Browse Code »

An incoming datagram must bring into cpu cache *lot* of cache lines,
in particular : (other parts omitted (hash chains, ip route cache...))

On 32bit arches :

offsetof(struct sock, sk_rcvbuf) =0x30 (read)
offsetof(struct sock, sk_lock) =0x34 (rw)

offsetof(struct sock, sk_sleep) =0x50 (read)
offsetof(struct sock, sk_rmem_alloc) =0x64 (rw)
offsetof(struct sock, sk_receive_queue)=0x74 (rw)

offsetof(struct sock, sk_forward_alloc)=0x98 (rw)

offsetof(struct sock, sk_callback_lock)=0xcc (rw)
offsetof(struct sock, sk_drops) =0xd8 (read if we add dropcount support, rw if frame dropped)
offsetof(struct sock, sk_filter) =0xf8 (read)

offsetof(struct sock, sk_socket) =0x138 (read)

offsetof(struct sock, sk_data_ready) =0x15c (read)

We can avoid sk->sk_socket and socket->fasync_list referencing on sockets
with no fasync() structures. (socket->fasync_list ptr is probably already in cache
because it shares a cache line with socket->wait, ie location pointed by sk->sk_sleep)

This avoids one cache line load per incoming packet for common cases (no fasync())

We can leave (or even move in a future patch) sk->sk_socket in a cold location

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2009-10-07 08:28:29 +0800

01 Oct, 2009

1 commit

b7058842c net: Make setsockopt() optlen be unsigned. ... Browse Code »

This provides safety against negative optlen at the type
level instead of depending upon (sometimes non-trivial)
checks against this sprinkled all over the the place, in
each and every implementation.

Based upon work done by Arjan van de Ven and feedback
from Linus Torvalds.

Signed-off-by: David S. Miller

David S. Miller
2009-10-01 07:12:20 +0800

17 Jul, 2009

1 commit

4dc6dc716 net: sock_copy() fixes ... Browse Code »

Commit e912b1142be8f1e2c71c71001dc992c6e5eb2ec1
(net: sk_prot_alloc() should not blindly overwrite memory)
took care of not zeroing whole new socket at allocation time.

sock_copy() is another spot where we should be very careful.
We should not set refcnt to a non null value, until
we are sure other fields are correctly setup, or
a lockless reader could catch this socket by mistake,
while not fully (re)initialized.

This patch puts sk_node & sk_refcnt to the very beginning
of struct sock to ease sock_copy() & sk_prot_alloc() job.

We add appropriate smp_wmb() before sk_refcnt initializations
to match our RCU requirements (changes to sock keys should
be committed to memory before sk_refcnt setting)

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2009-07-17 09:05:26 +0800

10 Jul, 2009

2 commits

ad4627695 memory barrier: adding smp_mb__after_lock ... Browse Code »

Adding smp_mb__after_lock define to be used as a smp_mb call after
a lock.

Making it nop for x86, since {read|write|spin}_lock() on x86 are
full memory barriers.

Signed-off-by: Jiri Olsa
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Jiri Olsa
2009-07-10 08:06:58 +0800
a57de0b43 net: adding memory barrier to the poll and receive callbacks ... Browse Code »

Adding memory barrier after the poll_wait function, paired with
receive callbacks. Adding fuctions sock_poll_wait and sk_has_sleeper
to wrap the memory barrier.

Without the memory barrier, following race can happen.
The race fires, when following code paths meet, and the tp->rcv_nxt
and __add_wait_queue updates stay in CPU caches.

CPU1 CPU2

sys_select receive packet
... ...
__add_wait_queue update tp->rcv_nxt
... ...
tp->rcv_nxt check sock_def_readable
... {
schedule ...
if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
wake_up_interruptible(sk->sk_sleep)
...
}

If there was no cache the code would work ok, since the wait_queue and
rcv_nxt are opposit to each other.

Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
passed the tp->rcv_nxt check and sleeps, or will get the new value for
tp->rcv_nxt and will return with new data mask.
In both cases the process (CPU1) is being added to the wait queue, so the
waitqueue_active (CPU2) call cannot miss and will wake up CPU1.

The bad case is when the __add_wait_queue changes done by CPU1 stay in its
cache, and so does the tp->rcv_nxt update on CPU2 side. The CPU1 will then
endup calling schedule and sleep forever if there are no more data on the
socket.

Calls to poll_wait in following modules were ommited:
net/bluetooth/af_bluetooth.c
net/irda/af_irda.c
net/irda/irnet/irnet_ppp.c
net/mac80211/rc80211_pid_debugfs.c
net/phonet/socket.c
net/rds/af_rds.c
net/rfkill/core.c
net/sunrpc/cache.c
net/sunrpc/rpc_pipe.c
net/tipc/socket.c

Signed-off-by: Jiri Olsa
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Jiri Olsa
2009-07-10 08:06:57 +0800

25 Jun, 2009

1 commit

09ce42d31 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6:
bnx2: Fix the behavior of ethtool when ONBOOT=no
qla3xxx: Don't sleep while holding lock.
qla3xxx: Give the PHY time to come out of reset.
ipv4 routing: Ensure that route cache entries are usable and reclaimable with caching is off
net: Move rx skb_orphan call to where needed
ipv6: Use correct data types for ICMPv6 type and code
net: let KS8842 driver depend on HAS_IOMEM
can: let SJA1000 driver depend on HAS_IOMEM
netxen: fix firmware init handshake
netxen: fix build with without CONFIG_PM
netfilter: xt_rateest: fix comparison with self
netfilter: xt_quota: fix incomplete initialization
netfilter: nf_log: fix direct userspace memory access in proc handler
netfilter: fix some sparse endianess warnings
netfilter: nf_conntrack: fix conntrack lookup race
netfilter: nf_conntrack: fix confirmation race condition
netfilter: nf_conntrack: death_by_timeout() fix

Linus Torvalds
2009-06-25 01:01:12 +0800

24 Jun, 2009

1 commit

d55d87fdf net: Move rx skb_orphan call to where needed ... Browse Code »

In order to get the tun driver to account packets, we need to be
able to receive packets with destructors set. To be on the safe
side, I added an skb_orphan call for all protocols by default since
some of them (IP in particular) cannot handle receiving packets
destructors properly.

Now it seems that at least one protocol (CAN) expects to be able
to pass skb->sk through the rx path without getting clobbered.

So this patch attempts to fix this properly by moving the skb_orphan
call to where it's actually needed. In particular, I've added it
to skb_set_owner_[rw] which is what most users of skb->destructor
call.

This is actually an improvement for tun too since it means that
we only give back the amount charged to the socket when the skb
is passed to another socket that will also be charged accordingly.

Signed-off-by: Herbert Xu
Tested-by: Oliver Hartkopp
Signed-off-by: David S. Miller

Herbert Xu
2009-06-24 07:36:25 +0800

19 Jun, 2009

1 commit

d2aa45503 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (55 commits)
netxen: fix tx ring accounting
netxen: fix detection of cut-thru firmware mode
forcedeth: fix dma api mismatches
atm: sk_wmem_alloc initial value is one
net: correct off-by-one write allocations reports
via-velocity : fix no link detection on boot
Net / e100: Fix suspend of devices that cannot be power managed
TI DaVinci EMAC : Fix rmmod error
net: group address list and its count
ipv4: Fix fib_trie rebalancing, part 2
pkt_sched: Update drops stats in act_police
sky2: version 1.23
sky2: add GRO support
sky2: skb recycling
sky2: reduce default transmit ring
sky2: receive counter update
sky2: fix shutdown synchronization
sky2: PCI irq issues
sky2: more receive shutdown
sky2: turn off pause during shutdown
...

Manually fix trivial conflict in net/core/skbuff.c due to kmemcheck

Linus Torvalds
2009-06-19 05:07:15 +0800

17 Jun, 2009

2 commits

c564039fd net: sk_wmem_alloc has initial value of one, not zero ... Browse Code »

commit 2b85a34e911bf483c27cfdd124aeb1605145dc80
(net: No more expensive sock_hold()/sock_put() on each tx)
changed initial sk_wmem_alloc value.

Some protocols check sk_wmem_alloc value to determine if a timer
must delay socket deallocation. We must take care of the sk_wmem_alloc
value being one instead of zero when no write allocations are pending.

Reported by Ingo Molnar, and full diagnostic from David Miller.

This patch introduces three helpers to get read/write allocations
and a followup patch will use these helpers to report correct
write allocations to user.

Reported-by: Ingo Molnar
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2009-06-17 19:31:25 +0800
b3fec0fe3 Merge branch 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/vegard/kmemcheck ... Browse Code »

* 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/vegard/kmemcheck: (39 commits)
signal: fix __send_signal() false positive kmemcheck warning
fs: fix do_mount_root() false positive kmemcheck warning
fs: introduce __getname_gfp()
trace: annotate bitfields in struct ring_buffer_event
net: annotate struct sock bitfield
c2port: annotate bitfield for kmemcheck
net: annotate inet_timewait_sock bitfields
ieee1394/csr1212: fix false positive kmemcheck report
ieee1394: annotate bitfield
net: annotate bitfields in struct inet_sock
net: use kmemcheck bitfields API for skbuff
kmemcheck: introduce bitfield API
kmemcheck: add opcode self-testing at boot
x86: unify pte_hidden
x86: make _PAGE_HIDDEN conditional
kmemcheck: make kconfig accessible for other architectures
kmemcheck: enable in the x86 Kconfig
kmemcheck: add hooks for the page allocator
kmemcheck: add hooks for page- and sg-dma-mappings
kmemcheck: don't track page tables
...

Linus Torvalds
2009-06-17 04:09:51 +0800

15 Jun, 2009

1 commit

a98b65a3a net: annotate struct sock bitfield ... Browse Code »

2009/2/24 Ingo Molnar :
> ok, this is the last warning i have from today's overnight -tip
> testruns - a 32-bit system warning in sock_init_data():
>
> [ 2.610389] NET: Registered protocol family 16
> [ 2.616138] initcall netlink_proto_init+0x0/0x170 returned 0 after 7812 usecs
> [ 2.620010] WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (f642c184)
> [ 2.624002] 010000000200000000000000604990c000000000000000000000000000000000
> [ 2.634076] i i i i i i u u i i i i i i i i i i i i i i i i i i i i i i i i
> [ 2.641038] ^
> [ 2.643376]
> [ 2.644004] Pid: 1, comm: swapper Not tainted (2.6.29-rc6-tip-01751-g4d1c22c-dirty #885)
> [ 2.648003] EIP: 0060:[] EFLAGS: 00010282 CPU: 0
> [ 2.652008] EIP is at sock_init_data+0xa1/0x190
> [ 2.656003] EAX: 0001a800 EBX: f6836c00 ECX: 00463000 EDX: c0e46fe0
> [ 2.660003] ESI: f642c180 EDI: c0b83088 EBP: f6863ed8 ESP: c0c412ec
> [ 2.664003] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> [ 2.668003] CR0: 8005003b CR2: f682c400 CR3: 00b91000 CR4: 000006f0
> [ 2.672003] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> [ 2.676003] DR6: ffff4ff0 DR7: 00000400
> [ 2.680002] [] __netlink_create+0x35/0xa0
> [ 2.684002] [] netlink_kernel_create+0x4c/0x140
> [ 2.688002] [] rtnetlink_net_init+0x1e/0x40
> [ 2.696002] [] register_pernet_operations+0x11/0x30
> [ 2.700002] [] register_pernet_subsys+0x1c/0x30
> [ 2.704002] [] rtnetlink_init+0x4c/0x100
> [ 2.708002] [] netlink_proto_init+0x159/0x170
> [ 2.712002] [] do_one_initcall+0x24/0x150
> [ 2.716002] [] do_initcalls+0x27/0x40
> [ 2.723201] [] do_basic_setup+0x1c/0x20
> [ 2.728002] [] kernel_init+0x5a/0xa0
> [ 2.732002] [] kernel_thread_helper+0x7/0x10
> [ 2.736002] [] 0xffffffff

We fix this false positive by annotating the bitfield in struct
sock.

Reported-by: Ingo Molnar
Signed-off-by: Vegard Nossum

Vegard Nossum
2009-06-15 21:49:36 +0800

11 Jun, 2009

1 commit

2b85a34e9 net: No more expensive sock_hold()/sock_put() on each tx ... Browse Code »

One of the problem with sock memory accounting is it uses
a pair of sock_hold()/sock_put() for each transmitted packet.

This slows down bidirectional flows because the receive path
also needs to take a refcount on socket and might use a different
cpu than transmit path or transmit completion path. So these
two atomic operations also trigger cache line bounces.

We can see this in tx or tx/rx workloads (media gateways for example),
where sock_wfree() can be in top five functions in profiles.

We use this sock_hold()/sock_put() so that sock freeing
is delayed until all tx packets are completed.

As we also update sk_wmem_alloc, we could offset sk_wmem_alloc
by one unit at init time, until sk_free() is called.
Once sk_free() is called, we atomic_dec_and_test(sk_wmem_alloc)
to decrement initial offset and atomicaly check if any packets
are in flight.

skb_set_owner_w() doesnt call sock_hold() anymore

sock_wfree() doesnt call sock_put() anymore, but check if sk_wmem_alloc
reached 0 to perform the final freeing.

Drawback is that a skb->truesize error could lead to unfreeable sockets, or
even worse, prematurely calling __sk_free() on a live socket.

Nice speedups on SMP. tbench for example, going from 2691 MB/s to 2711 MB/s
on my 8 cpu dev machine, even if tbench was not really hitting sk_refcnt
contention point. 5 % speedup on a UDP transmit workload (depends
on number of flows), lowering TX completion cpu usage.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2009-06-11 17:55:43 +0800

24 Feb, 2009

1 commit

e70049b9e Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ Browse Code »

David S. Miller
2009-02-24 19:50:29 +0800

18 Feb, 2009

1 commit

92a0acce1 net: Kill skb_truesize_check(), it only catches false-positives. ... Browse Code »

A long time ago we had bugs, primarily in TCP, where we would modify
skb->truesize (for TSO queue collapsing) in ways which would corrupt
the socket memory accounting.

skb_truesize_check() was added in order to try and catch this error
more systematically.

However this debugging check has morphed into a Frankenstein of sorts
and these days it does nothing other than catch false-positives.

Signed-off-by: David S. Miller

David S. Miller
2009-02-18 13:24:05 +0800

16 Feb, 2009

1 commit

20d494735 net: socket infrastructure for SO_TIMESTAMPING ... Browse Code »

The overlap with the old SO_TIMESTAMP[NS] options is handled so
that time stamping in software (net_enable_timestamp()) is
enabled when SO_TIMESTAMP[NS] and/or SO_TIMESTAMPING_RX_SOFTWARE
is set. It's disabled if all of these are off.

Signed-off-by: Patrick Ohly
Signed-off-by: David S. Miller

Patrick Ohly
2009-02-16 14:43:35 +0800

15 Feb, 2009

1 commit

5e3058952 Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ ... Browse Code »

Conflicts:
drivers/net/wireless/iwlwifi/iwl-agn.c
drivers/net/wireless/iwlwifi/iwl3945-base.c

David S. Miller
2009-02-15 15:12:00 +0800

13 Feb, 2009

1 commit

997093727 net: don't use in_atomic() in gfp_any() ... Browse Code »

The problem is that in_atomic() will return false inside spinlocks if
CONFIG_PREEMPT=n. This will lead to deadlockable GFP_KERNEL allocations
from spinlocked regions.

Secondly, if CONFIG_PREEMPT=y, this bug solves itself because networking
will instead use GFP_ATOMIC from this callsite. Hence we won't get the
might_sleep() debugging warnings which would have informed us of the buggy
callsites.

Solve both these problems by switching to in_interrupt(). Now, if someone
runs a gfp_any() allocation from inside spinlock we will get the warning
if CONFIG_PREEMPT=y.

I reviewed all callsites and most of them were too complex for my little
brain and none of them documented their interface requirements. I have no
idea what this patch will do.

Signed-off-by: Andrew Morton
Signed-off-by: David S. Miller

Andrew Morton
2009-02-13 08:43:17 +0800

05 Feb, 2009

1 commit

4cc7f68d6 net: Reexport sock_alloc_send_pskb ... Browse Code »

The function sock_alloc_send_pskb is completely useless if not
exported since most of the code in it won't be used as is. In
fact, this code has already been duplicated in the tun driver.

Now that we need accounting in the tun driver, we can in fact
use this function as is. So this patch marks it for export again.

Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller

Herbert Xu
2009-02-05 08:55:54 +0800

26 Nov, 2008

2 commits

dd24c0019 net: Use a percpu_counter for orphan_count ... Browse Code »

Instead of using one atomic_t per protocol, use a percpu_counter
for "orphan_count", to reduce cache line contention on
heavy duty network servers.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2008-11-26 13:17:14 +0800
1748376b6 net: Use a percpu_counter for sockets_allocated ... Browse Code »

Instead of using one atomic_t per protocol, use a percpu_counter
for "sockets_allocated", to reduce cache line contention on
heavy duty network servers.

Note : We revert commit (248969ae31e1b3276fc4399d67ce29a5d81e6fd9
net: af_unix can make unix_nr_socks visbile in /proc),
since it is not anymore used after sock_prot_inuse_add() addition

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2008-11-26 13:16:35 +0800

19 Nov, 2008

1 commit

198d6ba4d Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 ... Browse Code »

Conflicts:

drivers/isdn/i4l/isdn_net.c
fs/cifs/connect.c

David S. Miller
2008-11-19 15:38:23 +0800

17 Nov, 2008

1 commit

88ab1932e udp: Use hlist_nulls in UDP RCU code ... Browse Code »

This is a straightforward patch, using hlist_nulls infrastructure.

RCUification already done on UDP two weeks ago.

Using hlist_nulls permits us to avoid some memory barriers, both
at lookup time and delete time.

Patch is large because it adds new macros to include/net/sock.h.
These macros will be used by TCP & DCCP in next patch.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2008-11-17 11:39:21 +0800

14 Nov, 2008

1 commit

e8f6fbf62 lockdep: include/linux/lockdep.h - fix warning in net/bluetooth/af_bluetooth.c ... Browse Code »

fix this warning:

net/bluetooth/af_bluetooth.c:60: warning: ‘bt_key_strings’ defined but not used
net/bluetooth/af_bluetooth.c:71: warning: ‘bt_slock_key_strings’ defined but not used

this is a lockdep macro problem in the !LOCKDEP case.

We cannot convert it to an inline because the macro works on multiple types,
but we can mark the parameter used.

[ also clean up a misaligned tab in sock_lock_init_class_and_name() ]

[ also remove #ifdefs from around af_family_clock_key strings - which
were certainly added to get rid of the ugly build warnings. ]

Signed-off-by: Ingo Molnar
Signed-off-by: David S. Miller

Ingo Molnar
2008-11-14 15:19:10 +0800

13 Nov, 2008

1 commit

237898248 net: ifdef struct sock::sk_async_wait_queue ... Browse Code »

Every user is under CONFIG_NET_DMA already, so ifdef field as well.

Signed-off-by: Alexey Dobriyan
Signed-off-by: David S. Miller

Alexey Dobriyan
2008-11-13 15:25:32 +0800

05 Nov, 2008

1 commit

d5f642384 net: #ifdef ->sk_security ... Browse Code »

Signed-off-by: Alexey Dobriyan
Signed-off-by: David S. Miller

Alexey Dobriyan
2008-11-05 06:45:58 +0800

31 Oct, 2008

2 commits

a1744d3be Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 ... Browse Code »

Conflicts:

drivers/net/wireless/p54/p54common.c

David S. Miller
2008-10-31 15:17:34 +0800
ad1d967c8 net: delete excess kernel-doc notation ... Browse Code »

Remove excess kernel-doc function parameters from networking header
& driver files:

Warning(include/net/sock.h:946): Excess function parameter or struct member 'sk' description in 'sk_filter_release'
Warning(include/linux/netdevice.h:1545): Excess function parameter or struct member 'cpu' description in 'netif_tx_lock'
Warning(drivers/net/wan/z85230.c:712): Excess function parameter or struct member 'regs' description in 'z8530_interrupt'

Signed-off-by: Randy Dunlap
Signed-off-by: David S. Miller

Randy Dunlap
2008-10-31 14:54:35 +0800

30 Oct, 2008

1 commit

96631ed16 udp: introduce sk_for_each_rcu_safenext() ... Browse Code »

Corey Minyard found a race added in commit 271b72c7fa82c2c7a795bc16896149933110672d
(udp: RCU handling for Unicast packets.)

"If the socket is moved from one list to another list in-between the
time the hash is calculated and the next field is accessed, and the
socket has moved to the end of the new list, the traversal will not
complete properly on the list it should have, since the socket will
be on the end of the new list and there's not a way to tell it's on a
new list and restart the list traversal. I think that this can be
solved by pre-fetching the "next" field (with proper barriers) before
checking the hash."

This patch corrects this problem, introducing a new
sk_for_each_rcu_safenext() macro.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2008-10-30 02:19:58 +0800

29 Oct, 2008

3 commits

271b72c7f udp: RCU handling for Unicast packets. ... Browse Code »

Goals are :

1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path.

Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.

2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.

- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."

- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.

Theory of operation :
---------------------

As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.

Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.

In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.

We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2008-10-29 17:11:14 +0800
645ca708f udp: introduce struct udp_table and multiple spinlocks ... Browse Code »

UDP sockets are hashed in a 128 slots hash table.

This hash table is protected by *one* rwlock.

This rwlock is readlocked each time an incoming UDP message is handled.

This rwlock is writelocked each time a socket must be inserted in
hash table (bind time), or deleted from this table (close time)

This is not scalable on SMP machines :

1) Even in read mode, lock() and unlock() are atomic operations and
must dirty a contended cache line, shared by all cpus.

2) A writer might be starved if many readers are 'in flight'. This can
happen on a machine with some NIC receiving many UDP messages. User
process can be delayed a long time at socket creation/dismantle time.

This patch prepares RCU migration, by introducing 'struct udp_table
and struct udp_hslot', and using one spinlock per chain, to reduce
contention on central rwlock.

Introducing one spinlock per chain reduces latencies, for port
randomization on heavily loaded UDP servers. This also speedup
bindings to specific ports.

udp_lib_unhash() was uninlined, becoming to big.

Some cleanups were done to ease review of following patch
(RCUification of UDP Unicast lookups)

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2008-10-29 16:41:45 +0800
def8b4faf net: reduce structures when XFRM=n ... Browse Code »

ifdef out
* struct sk_buff::sp (pointer)
* struct dst_entry::xfrm (pointer)
* struct sock::sk_policy (2 pointers)

Signed-off-by: Alexey Dobriyan
Signed-off-by: David S. Miller

Alexey Dobriyan
2008-10-29 04:24:06 +0800