Eric Lee / smarc-fsl-linux-kernel

23 Dec, 2011

1 commit

0fd7bac6b net: relax rcvbuf limits ... Browse Code »

skb->truesize might be big even for a small packet.

Its even bigger after commit 87fb4b7b533 (net: more accurate skb
truesize) and big MTU.

We should allow queueing at least one packet per receiver, even with a
low RCVBUF setting.

Reported-by: Michal Simek
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2011-12-23 15:15:14 +0800

26 Oct, 2011

1 commit

b0691c8ee net: Unlock sock before calling sk_free() ... Browse Code »
1

Signed-off-by: Thomas Gleixner
Signed-off-by: David S. Miller

Thomas Gleixner
2011-10-26 07:17:25 +0800

14 Oct, 2011

1 commit

87fb4b7b5 net: more accurate skb truesize ... Browse Code »

skb truesize currently accounts for sk_buff struct and part of skb head.
kmalloc() roundings are also ignored.

Considering that skb_shared_info is larger than sk_buff, its time to
take it into account for better memory accounting.

This patch introduces SKB_TRUESIZE(X) macro to centralize various
assumptions into a single place.

At skb alloc phase, we put skb_shared_info struct at the exact end of
skb head, to allow a better use of memory (lowering number of
reallocations), since kmalloc() gives us power-of-two memory blocks.

Unless SLUB/SLUB debug is active, both skb->head and skb_shared_info are
aligned to cache lines, as before.

Note: This patch might trigger performance regressions because of
misconfigured protocol stacks, hitting per socket or global memory
limits that were previously not reached. But its a necessary step for a
more accurate memory accounting.

Signed-off-by: Eric Dumazet
CC: Andi Kleen
CC: Ben Hutchings
Signed-off-by: David S. Miller

Eric Dumazet
2011-10-14 04:05:07 +0800

08 Oct, 2011

1 commit

8083f0fc9 net: use sock_valbool_flag to set/clear SOCK_RXQ_OVFL ... Browse Code »

There's no point in open-coding sock_valbool_flag().

Signed-off-by: Johannes Berg
Acked-by: Neil Horman
Signed-off-by: David S. Miller

Johannes Berg
2011-10-08 01:27:07 +0800

25 Aug, 2011

1 commit

ea2ab6937 net: convert core to skb paged frag APIs ... Browse Code »

Signed-off-by: Ian Campbell
Cc: "David S. Miller"
Cc: Eric Dumazet
Cc: "Michał Mirosław"
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller

Ian Campbell
2011-08-25 08:52:11 +0800

02 Aug, 2011

1 commit

a9b3cd7f3 rcu: convert uses of rcu_assign_pointer(x, NULL) to RCU_INIT_POINTER ... Browse Code »

When assigning a NULL value to an RCU protected pointer, no barrier
is needed. The rcu_assign_pointer, used to handle that but will soon
change to not handle the special case.

Convert all rcu_assign_pointer of NULL value.

//smpl
@@ expression P; @@

- rcu_assign_pointer(P, NULL)
+ RCU_INIT_POINTER(P, NULL)

//

Signed-off-by: Stephen Hemminger
Acked-by: Paul E. McKenney
Signed-off-by: David S. Miller

Stephen Hemminger
2011-08-02 19:29:23 +0800

08 Jul, 2011

1 commit

204d1641d Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/… ... Browse Code »

…wireless-next-2.6 into for-davem

John W. Linville
2011-07-08 23:03:36 +0800

06 Jul, 2011

1 commit

c7fe3b52c NFC: add NFC socket family ... Browse Code »

Signed-off-by: Lauro Ramos Venancio
Signed-off-by: Aloisio Almeida Jr
Signed-off-by: John W. Linville

Aloisio Almeida Jr
2011-07-06 03:26:58 +0800

22 Jun, 2011

1 commit

3847ce32a core: add tracepoints for queueing skb to rcvbuf ... Browse Code »

This patch adds 2 tracepoints to get a status of a socket receive queue
and related parameter.

One tracepoint is added to sock_queue_rcv_skb. It records rcvbuf size
and its usage. The other tracepoint is added to __sk_mem_schedule and
it records limitations of memory for sockets and current usage.

By using these tracepoints we're able to know detailed reason why kernel
drop the packet.

Signed-off-by: Satoru Moriya
Acked-by: Neil Horman
Signed-off-by: David S. Miller

Satoru Moriya
2011-06-22 07:06:10 +0800

31 Mar, 2011

1 commit

25985edce Fix common misspellings ... Browse Code »

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi

Lucas De Marchi
2011-03-31 22:26:23 +0800

14 Jan, 2011

1 commit

27d189c02 Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (46 commits)
hwrng: via_rng - Fix memory scribbling on some CPUs
crypto: padlock - Move padlock.h into include/crypto
hwrng: via_rng - Fix asm constraints
crypto: n2 - use __devexit not __exit in n2_unregister_algs
crypto: mark crypto workqueues CPU_INTENSIVE
crypto: mv_cesa - dont return PTR_ERR() of wrong pointer
crypto: ripemd - Set module author and update email address
crypto: omap-sham - backlog handling fix
crypto: gf128mul - Remove experimental tag
crypto: af_alg - fix af_alg memory_allocated data type
crypto: aesni-intel - Fixed build with binutils 2.16
crypto: af_alg - Make sure sk_security is initialized on accept()ed sockets
net: Add missing lockdep class names for af_alg
include: Install linux/if_alg.h for user-space crypto API
crypto: omap-aes - checkpatch --file warning fixes
crypto: omap-aes - initialize aes module once per request
crypto: omap-aes - unnecessary code removed
crypto: omap-aes - error handling implementation improved
crypto: omap-aes - redundant locking is removed
crypto: omap-aes - DMA initialization fixes for OMAP off mode
...

Linus Torvalds
2011-01-14 02:25:58 +0800

07 Jan, 2011

1 commit

2c6607c61 net: add POLLPRI to sock_def_readable() ... Browse Code »

Leonardo Chiquitto found poll() could block forever on tcp sockets and
Urgent data was received, if the event flag only contains POLLPRI.

He did a bisection and found commit 4938d7e0233 (poll: avoid extra
wakeups in select/poll) was the source of the problem.

Problem is TCP sockets use standard sock_def_readable() function for
their sk_data_ready() handler, and sock_def_readable() doesnt signal
POLLPRI.

Only TCP is affected by the problem. Adding POLLPRI to the list of flags
might trigger unnecessary schedules, but URGENT handling is such a
seldom used feature this seems a good compromise.

Thanks a lot to Leonardo for providing the bisection result and a test
program as well.

Reference : http://www.spinics.net/lists/netdev/msg151793.html

Reported-and-bisected-by: Leonardo Chiquitto
Signed-off-by: Eric Dumazet
Tested-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2011-01-07 02:54:29 +0800

18 Dec, 2010

1 commit

b4aa9e05a Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 ... Browse Code »

Conflicts:
drivers/net/bnx2x/bnx2x.h
drivers/net/wireless/iwlwifi/iwl-1000.c
drivers/net/wireless/iwlwifi/iwl-6000.c
drivers/net/wireless/iwlwifi/iwl-core.h
drivers/vhost/vhost.c

David S. Miller
2010-12-18 04:27:22 +0800

17 Dec, 2010

1 commit

fcbdf09d9 net: fix nulls list corruptions in sk_prot_alloc ... Browse Code »

Special care is taken inside sk_port_alloc to avoid overwriting
skc_node/skc_nulls_node. We should also avoid overwriting
skc_bind_node/skc_portaddr_node.

The patch fixes the following crash:

BUG: unable to handle kernel paging request at fffffffffffffff0
IP: [] udp4_lib_lookup2+0xad/0x370
[] __udp4_lib_lookup+0x282/0x360
[] __udp4_lib_rcv+0x31e/0x700
[] ? ip_local_deliver_finish+0x65/0x190
[] ? ip_local_deliver+0x88/0xa0
[] udp_rcv+0x15/0x20
[] ip_local_deliver_finish+0x65/0x190
[] ip_local_deliver+0x88/0xa0
[] ip_rcv_finish+0x32d/0x6f0
[] ? netif_receive_skb+0x99c/0x11c0
[] ip_rcv+0x2bb/0x350
[] netif_receive_skb+0x99c/0x11c0

Signed-off-by: Leonard Crestez
Signed-off-by: Octavian Purdila
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Octavian Purdila
2010-12-17 06:26:56 +0800

10 Dec, 2010

1 commit

68835aba4 net: optimize INET input path further ... Browse Code »

Followup of commit b178bb3dfc30 (net: reorder struct sock fields)

Optimize INET input path a bit further, by :

1) moving sk_refcnt close to sk_lock.

This reduces number of dirtied cache lines by one on 64bit arches (and
64 bytes cache line size).

2) moving inet_daddr & inet_rcv_saddr at the beginning of sk

(same cache line than hash / family / bound_dev_if / nulls_node)

This reduces number of accessed cache lines in lookups by one, and dont
increase size of inet and timewait socks.
inet and tw sockets now share same place-holder for these fields.

Before patch :

offsetof(struct sock, sk_refcnt) = 0x10
offsetof(struct sock, sk_lock) = 0x40
offsetof(struct sock, sk_receive_queue) = 0x60
offsetof(struct inet_sock, inet_daddr) = 0x270
offsetof(struct inet_sock, inet_rcv_saddr) = 0x274

After patch :

offsetof(struct sock, sk_refcnt) = 0x44
offsetof(struct sock, sk_lock) = 0x48
offsetof(struct sock, sk_receive_queue) = 0x68
offsetof(struct inet_sock, inet_daddr) = 0x0
offsetof(struct inet_sock, inet_rcv_saddr) = 0x4

compute_score() (udp or tcp) now use a single cache line per ignored
item, instead of two.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-12-10 12:05:58 +0800

08 Dec, 2010

1 commit

6f107b586 net: Add missing lockdep class names for af_alg ... Browse Code »

Signed-off-by: Miloslav Trmač
Signed-off-by: Herbert Xu

Miloslav Trmač
2010-12-08 14:35:34 +0800

11 Nov, 2010

1 commit

8d987e5c7 net: avoid limits overflow ... Browse Code »

Robin Holt tried to boot a 16TB machine and found some limits were
reached : sysctl_tcp_mem[2], sysctl_udp_mem[2]

We can switch infrastructure to use long "instead" of "int", now
atomic_long_t primitives are available for free.

Signed-off-by: Eric Dumazet
Reported-by: Robin Holt
Reviewed-by: Robin Holt
Signed-off-by: Andrew Morton
Signed-off-by: David S. Miller

Eric Dumazet
2010-11-11 04:12:00 +0800

26 Oct, 2010

1 commit

0d7da9ddd net: add __rcu annotation to sk_filter ... Browse Code »

Add __rcu annotation to :
(struct sock)->sk_filter

And use appropriate rcu primitives to reduce sparse warnings if
CONFIG_SPARSE_RCU_POINTER=y

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-10-26 05:18:28 +0800

24 Oct, 2010

1 commit

5f05647dd Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1699 commits)
bnx2/bnx2x: Unsupported Ethtool operations should return -EINVAL.
vlan: Calling vlan_hwaccel_do_receive() is always valid.
tproxy: use the interface primary IP address as a default value for --on-ip
tproxy: added IPv6 support to the socket match
cxgb3: function namespace cleanup
tproxy: added IPv6 support to the TPROXY target
tproxy: added IPv6 socket lookup function to nf_tproxy_core
be2net: Changes to use only priority codes allowed by f/w
tproxy: allow non-local binds of IPv6 sockets if IP_TRANSPARENT is enabled
tproxy: added tproxy sockopt interface in the IPV6 layer
tproxy: added udp6_lib_lookup function
tproxy: added const specifiers to udp lookup functions
tproxy: split off ipv6 defragmentation to a separate module
l2tp: small cleanup
nf_nat: restrict ICMP translation for embedded header
can: mcp251x: fix generation of error frames
can: mcp251x: fix endless loop in interrupt handler if CANINTF_MERRF is set
can-raw: add msg_flags to distinguish local traffic
9p: client code cleanup
rds: make local functions/variables static
...

Fix up conflicts in net/core/dev.c, drivers/net/pcmcia/smc91c92_cs.c and
drivers/net/wireless/ath/ath9k/debug.c as per David

Linus Torvalds
2010-10-24 02:47:02 +0800

08 Oct, 2010

1 commit

1144182a8 net: suppress RCU lockdep false positive in sock_update_classid ... Browse Code »

> ===================================================
> [ INFO: suspicious rcu_dereference_check() usage. ]
> ---------------------------------------------------
> include/linux/cgroup.h:542 invoked rcu_dereference_check() without protection!
>
> other info that might help us debug this:
>
>
> rcu_scheduler_active = 1, debug_locks = 0
> 1 lock held by swapper/1:
> #0: (net_mutex){+.+.+.}, at: []
> register_pernet_subsys+0x1f/0x47
>
> stack backtrace:
> Pid: 1, comm: swapper Not tainted 2.6.35.4-28.fc14.x86_64 #1
> Call Trace:
> [] lockdep_rcu_dereference+0xaa/0xb3
> [] sock_update_classid+0x7c/0xa2
> [] sk_alloc+0x6b/0x77
> [] __netlink_create+0x37/0xab
> [] ? rtnetlink_rcv+0x0/0x2d
> [] netlink_kernel_create+0x74/0x19d
> [] ? __mutex_lock_common+0x339/0x35b
> [] rtnetlink_net_init+0x2e/0x48
> [] ops_init+0xe9/0xff
> [] register_pernet_operations+0xab/0x130
> [] register_pernet_subsys+0x2e/0x47
> [] rtnetlink_init+0x53/0x102
> [] netlink_proto_init+0x126/0x143
> [] ? netlink_proto_init+0x0/0x143
> [] do_one_initcall+0x72/0x186
> [] kernel_init+0x23b/0x2c9
> [] kernel_thread_helper+0x4/0x10
> [] ? restore_args+0x0/0x30
> [] ? kernel_init+0x0/0x2c9
> [] ? kernel_thread_helper+0x0/0x10

The sock_update_classid() function calls task_cls_classid(current),
but the calling task cannot go away, so there is no danger of
the associated structures disappearing. Insert an RCU read-side
critical section to suppress the false positive.

Reported-by: Subrata Modak
Signed-off-by: Paul E. McKenney

Paul E. McKenney
2010-10-08 01:02:28 +0800

27 Sep, 2010

1 commit

e40051d13 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 ... Browse Code »

Conflicts:
drivers/net/qlcnic/qlcnic_init.c
net/ipv4/ip_output.c

David S. Miller
2010-09-27 16:03:03 +0800

25 Sep, 2010

1 commit

f064af1e5 net: fix a lockdep splat ... Browse Code »

We have for each socket :

One spinlock (sk_slock.slock)
One rwlock (sk_callback_lock)

Possible scenarios are :

(A) (this is used in net/sunrpc/xprtsock.c)
read_lock(&sk->sk_callback_lock) (without blocking BH)

spin_lock(&sk->sk_slock.slock);
...
read_lock(&sk->sk_callback_lock);
...

(B)
write_lock_bh(&sk->sk_callback_lock)
stuff
write_unlock_bh(&sk->sk_callback_lock)

(C)
spin_lock_bh(&sk->sk_slock)
...
write_lock_bh(&sk->sk_callback_lock)
stuff
write_unlock_bh(&sk->sk_callback_lock)
spin_unlock_bh(&sk->sk_slock)

This (C) case conflicts with (A) :

CPU1 [A] CPU2 [C]
read_lock(callback_lock)
spin_lock_bh(slock)

We have one problematic (C) use case in inet_csk_listen_stop() :

local_bh_disable();
bh_lock_sock(child); // spin_lock_bh(&sk->sk_slock)
WARN_ON(sock_owned_by_user(child));
...
sock_orphan(child); // write_lock_bh(&sk->sk_callback_lock)

lockdep is not happy with this, as reported by Tetsuo Handa

It seems only way to deal with this is to use read_lock_bh(callbacklock)
everywhere.

Thanks to Jarek for pointing a bug in my first attempt and suggesting
this solution.

Reported-by: Tetsuo Handa
Tested-by: Tetsuo Handa
Signed-off-by: Eric Dumazet
CC: Jarek Poplawski
Tested-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-09-25 13:26:10 +0800

10 Sep, 2010

1 commit

f39234d60 net/core: add lock context change annotations in net/core/sock.c ... Browse Code »

__lock_sock() and __release_sock() releases and regrabs lock but
were missing proper annotations. Add it. This removes following
warning from sparse. (Currently __lock_sock() does not emit any
warning about it but I think it is better to add also.)

net/core/sock.c:1580:17: warning: context imbalance in '__release_sock' - unexpected unlock

Signed-off-by: Namhyung Kim
Signed-off-by: David S. Miller

Namhyung Kim
2010-09-10 06:02:39 +0800

20 Jul, 2010

1 commit

d6d9ca0fe net: this_cpu_xxx conversions ... Browse Code »

Use modern this_cpu_xxx() api, saving few bytes on x86

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-07-20 06:12:51 +0800

13 Jul, 2010

1 commit

d361fd599 net: sock_free() optimizations ... Browse Code »

Avoid two extra instructions in sock_free(), to reload
skb->truesize and skb->sk

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-07-13 11:21:46 +0800

17 Jun, 2010

3 commits

3924773a5 net: Export cred_to_ucred to modules. ... Browse Code »

AF_UNIX references this, and can be built as a module,
so...

Signed-off-by: David S. Miller

David S. Miller
2010-06-17 07:18:25 +0800
109f6e39f af_unix: Allow SO_PEERCRED to work across namespaces. ... Browse Code »

Use struct pid and struct cred to store the peer credentials on struct
sock. This gives enough information to convert the peer credential
information to a value relative to whatever namespace the socket is in
at the time.

This removes nasty surprises when using SO_PEERCRED on socket
connetions where the processes on either side are in different pid and
user namespaces.

Signed-off-by: Eric W. Biederman
Acked-by: Daniel Lezcano
Acked-by: Pavel Emelyanov
Signed-off-by: David S. Miller

Eric W. Biederman
2010-06-17 05:55:55 +0800
3f551f943 sock: Introduce cred_to_ucred ... Browse Code »

To keep the coming code clear and to allow both the sock
code and the scm code to share the logic introduce a
fuction to translate from struct cred to struct ucred.

Signed-off-by: Eric W. Biederman
Acked-by: Pavel Emelyanov
Signed-off-by: David S. Miller

Eric W. Biederman
2010-06-17 05:55:35 +0800

07 Jun, 2010

1 commit

fe33147a5 net-caif: Added missing lock validator constants ... Browse Code »

CAIF is using "xxx-AF_MAX" strings for the lock validator. It should use
its own strings.

Signed-off-by: Alex Lorca
Signed-off-by: David S. Miller

Alex Lorca
2010-06-07 16:01:22 +0800

27 May, 2010

1 commit

8a74ad60a net: fix lock_sock_bh/unlock_sock_bh ... Browse Code »

This new sock lock primitive was introduced to speedup some user context
socket manipulation. But it is unsafe to protect two threads, one using
regular lock_sock/release_sock, one using lock_sock_bh/unlock_sock_bh

This patch changes lock_sock_bh to be careful against 'owned' state.
If owned is found to be set, we must take the slow path.
lock_sock_bh() now returns a boolean to say if the slow path was taken,
and this boolean is used at unlock_sock_bh time to call the appropriate
unlock function.

After this change, BH are either disabled or enabled during the
lock_sock_bh/unlock_sock_bh protected section. This might be misleading,
so we rename these functions to lock_sock_fast()/unlock_sock_fast().

Reported-by: Anton Blanchard
Signed-off-by: Eric Dumazet
Tested-by: Anton Blanchard
Signed-off-by: David S. Miller

Eric Dumazet
2010-05-27 15:30:53 +0800

24 May, 2010

2 commits

828627428 tun: Update classid on packet injection ... Browse Code »

This patch makes tun update its socket classid every time we
inject a packet into the network stack. This is so that any
updates made by the admin to the process writing packets to
tun is effected.

Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller

Herbert Xu
2010-05-24 15:14:10 +0800
f84517253 cls_cgroup: Store classid in struct sock ... Browse Code »
43

Up until now cls_cgroup has relied on fetching the classid out of
the current executing thread. This runs into trouble when a packet
processing is delayed in which case it may execute out of another
thread's context.

Furthermore, even when a packet is not delayed we may fail to
classify it if soft IRQs have been disabled, because this scenario
is indistinguishable from one where a packet unrelated to the
current thread is processed by a real soft IRQ.

In fact, the current semantics is inherently broken, as a single
skb may be constructed out of the writes of two different tasks.
A different manifestation of this problem is when the TCP stack
transmits in response of an incoming ACK. This is currently
unclassified.

As we already have a concept of packet ownership for accounting
purposes in the skb->sk pointer, this is a natural place to store
the classid in a persistent manner.

This patch adds the cls_cgroup classid in struct sock, filling up
an existing hole on 64-bit :)

The value is set at socket creation time. So all sockets created
via socket(2) automatically gains the ID of the thread creating it.
Whenever another process touches the socket by either reading or
writing to it, we will change the socket classid to that of the
process if it has a valid (non-zero) classid.

For sockets created on inbound connections through accept(2), we
inherit the classid of the original listening socket through
sk_clone, possibly preceding the actual accept(2) call.

In order to minimise risks, I have not made this the authoritative
classid. For now it is only used as a backup when we execute
with soft IRQs disabled. Once we're completely happy with its
semantics we can use it as the sole classid.

Footnote: I have rearranged the error path on cls_group module
creation. If we didn't do this, then there is a window where
someone could create a tc rule using cls_group before the cgroup
subsystem has been registered.

Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller

Herbert Xu
2010-05-24 15:12:34 +0800

18 May, 2010

1 commit

7fee226ad net: add a noref bit on skb dst ... Browse Code »

Use low order bit of skb->_skb_dst to tell dst is not refcounted.

Change _skb_dst to _skb_refdst to make sure all uses are catched.

skb_dst() returns the dst, regardless of noref bit set or not, but
with a lockdep check to make sure a noref dst is not given if current
user is not rcu protected.

New skb_dst_set_noref() helper to set an notrefcounted dst on a skb.
(with lockdep check)

skb_dst_drop() drops a reference only if skb dst was refcounted.

skb_dst_force() helper is used to force a refcount on dst, when skb
is queued and not anymore RCU protected.

Use skb_dst_force() in __sk_add_backlog(), __dev_xmit_skb() if
!IFF_XMIT_DST_RELEASE or skb enqueued on qdisc queue, in
sock_queue_rcv_skb(), in __nf_queue().

Use skb_dst_force() in dev_requeue_skb().

Note: dst_use_noref() still dirties dst, we might transform it
later to do one dirtying per jiffies.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-05-18 08:18:50 +0800

16 May, 2010

1 commit

a465419b1 net: Introduce sk_route_nocaps ... Browse Code »

TCP-MD5 sessions have intermittent failures, when route cache is
invalidated. ip_queue_xmit() has to find a new route, calls
sk_setup_caps(sk, &rt->u.dst), destroying the

sk->sk_route_caps &= ~NETIF_F_GSO_MASK

that MD5 desperately try to make all over its way (from
tcp_transmit_skb() for example)

So we send few bad packets, and everything is fine when
tcp_transmit_skb() is called again for this socket.

Since ip_queue_xmit() is at a lower level than TCP-MD5, I chose to use a
socket field, sk_route_nocaps, containing bits to mask on sk_route_caps.

Reported-by: Bhaskar Dutta
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-05-16 15:36:33 +0800

02 May, 2010

1 commit

438154823 net: sock_def_readable() and friends RCU conversion ... Browse Code »

sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we
need two atomic operations (and associated dirtying) per incoming
packet.

RCU conversion is pretty much needed :

1) Add a new structure, called "struct socket_wq" to hold all fields
that will need rcu_read_lock() protection (currently: a
wait_queue_head_t and a struct fasync_struct pointer).

[Future patch will add a list anchor for wakeup coalescing]

2) Attach one of such structure to each "struct socket" created in
sock_alloc_inode().

3) Respect RCU grace period when freeing a "struct socket_wq"

4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct
socket_wq"

5) Change sk_sleep() function to use new sk->sk_wq instead of
sk->sk_sleep

6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside
a rcu_read_lock() section.

7) Change all sk_has_sleeper() callers to :
- Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock)
- Use wq_has_sleeper() to eventually wakeup tasks.
- Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock)

8) sock_wake_async() is modified to use rcu protection as well.

9) Exceptions :
macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq"
instead of dynamically allocated ones. They dont need rcu freeing.

Some cleanups or followups are probably needed, (possible
sk_callback_lock conversion to a spinlock for example...).

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-05-02 06:00:15 +0800

28 Apr, 2010

1 commit

c377411f2 net: sk_add_backlog() take rmem_alloc into account ... Browse Code »

Current socket backlog limit is not enough to really stop DDOS attacks,
because user thread spend many time to process a full backlog each
round, and user might crazy spin on socket lock.

We should add backlog size and receive_queue size (aka rmem_alloc) to
pace writers, and let user run without being slow down too much.

Introduce a sk_rcvqueues_full() helper, to avoid taking socket lock in
stress situations.

Under huge stress from a multiqueue/RPS enabled NIC, a single flow udp
receiver can now process ~200.000 pps (instead of ~100 pps before the
patch) on a 8 core machine.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-04-28 06:13:20 +0800

21 Apr, 2010

1 commit

aa3951451 net: sk_sleep() helper ... Browse Code »

Define a new function to return the waitqueue of a "struct sock".

static inline wait_queue_head_t *sk_sleep(struct sock *sk)
{
return sk->sk_sleep;
}

Change all read occurrences of sk_sleep by a call to this function.

Needed for a future RCU conversion. sk_sleep wont be a field directly
available.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-04-21 07:37:13 +0800

13 Apr, 2010

1 commit

b6c6712a4 net: sk_dst_cache RCUification ... Browse Code »

With latest CONFIG_PROVE_RCU stuff, I felt more comfortable to make this
work.

sk->sk_dst_cache is currently protected by a rwlock (sk_dst_lock)

This rwlock is readlocked for a very small amount of time, and dst
entries are already freed after RCU grace period. This calls for RCU
again :)

This patch converts sk_dst_lock to a spinlock, and use RCU for readers.

__sk_dst_get() is supposed to be called with rcu_read_lock() or if
socket locked by user, so use appropriate rcu_dereference_check()
condition (rcu_read_lock_held() || sock_owned_by_user(sk))

This patch avoids two atomic ops per tx packet on UDP connected sockets,
for example, and permits sk_dst_lock to be much less dirtied.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-04-13 16:41:33 +0800

08 Mar, 2010

1 commit

72150e9b7 sock.c: potential null dereference ... Browse Code »

We test that "prot->rsk_prot" is non-null right before we dereference it
on this line.

Signed-off-by: Dan Carpenter
Signed-off-by: David S. Miller

Dan Carpenter
2010-03-08 07:25:50 +0800

06 Mar, 2010

1 commit

a3a858ff1 net: backlog functions rename ... Browse Code »

sk_add_backlog -> __sk_add_backlog
sk_add_backlog_limited -> sk_add_backlog

Signed-off-by: Zhu Yi
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Zhu Yi
2010-03-06 05:34:03 +0800