Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

10 May, 2007

40 commits

f7e4217b0 rename thread_info to stack ... Browse Code »

This finally renames the thread_info field in task structure to stack, so that
the assumptions about this field are gone and archs have more freedom about
placing the thread_info structure.

Nonbroken archs which have a proper thread pointer can do the access to both
current thread and task structure via a single pointer.

It'll allow for a few more cleanups of the fork code, from which e.g. ia64
could benefit.

Signed-off-by: Roman Zippel
[akpm@linux-foundation.org: build fix]
Cc: Richard Henderson
Cc: Ivan Kokshaysky
Cc: Russell King
Cc: Ian Molton
Cc: Haavard Skinnemoen
Cc: Mikael Starvik
Cc: David Howells
Cc: Yoshinori Sato
Cc: "Luck, Tony"
Cc: Hirokazu Takata
Cc: Geert Uytterhoeven
Cc: Roman Zippel
Cc: Greg Ungerer
Cc: Ralf Baechle
Cc: Ralf Baechle
Cc: Paul Mackerras
Cc: Benjamin Herrenschmidt
Cc: Heiko Carstens
Cc: Martin Schwidefsky
Cc: Paul Mundt
Cc: Kazumoto Kojima
Cc: Richard Curnow
Cc: William Lee Irwin III
Cc: "David S. Miller"
Cc: Jeff Dike
Cc: Paolo 'Blaisorblade' Giarrusso
Cc: Miles Bader
Cc: Andi Kleen
Cc: Chris Zankel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Zippel
2007-05-10 03:30:56 +0800
c9f4f06d3 wrap access to thread_info ... Browse Code »

Recently a few direct accesses to the thread_info in the task structure snuck
back, so this wraps them with the appropriate wrapper.

Signed-off-by: Roman Zippel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Zippel
2007-05-10 03:30:56 +0800
e61a1c1c4 Allow arch to initialize arch field of the module structure ... Browse Code »

This will later allow an arch to add module specific information via linker
generated tables instead of poking directly in the module object structure.

Signed-off-by: Roman Zippel
Signed-off-by: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Zippel
2007-05-10 03:30:56 +0800
b52f52a09 clocksource: fix resume logic ... Browse Code »

We need to make sure that the clocksources are resumed, when timekeeping is
resumed. The current resume logic does not guarantee this.

Add a resume function pointer to the clocksource struct, so clocksource
drivers which need to reinitialize the clocksource can provide a resume
function.

Add a resume function, which calls the maybe available clocksource resume
functions and resets the watchdog function, so a stable TSC can be used
accross suspend/resume.

Signed-off-by: Thomas Gleixner
Cc: john stultz
Cc: Andi Kleen
Cc: Ingo Molnar
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Thomas Gleixner
2007-05-10 03:30:56 +0800
4037d4522 Move remote node draining out of slab allocators ... Browse Code »

Currently the slab allocators contain callbacks into the page allocator to
perform the draining of pagesets on remote nodes. This requires SLUB to have
a whole subsystem in order to be compatible with SLAB. Moving node draining
out of the slab allocators avoids a section of code in SLUB.

Move the node draining so that is is done when the vm statistics are updated.
At that point we are already touching all the cachelines with the pagesets of
a processor.

Add a expire counter there. If we have to update per zone or global vm
statistics then assume that the pageset will require subsequent draining.

The expire counter will be decremented on each vm stats update pass until it
reaches zero. Then we will drain one batch from the pageset. The draining
will cause vm counter updates which will then cause another expiration until
the pcp is empty. So we will drain a batch every 3 seconds.

Note that remote node draining is a somewhat esoteric feature that is required
on large NUMA systems because otherwise significant portions of system memory
can become trapped in pcp queues. The number of pcp is determined by the
number of processors and nodes in a system. A system with 4 processors and 2
nodes has 8 pcps which is okay. But a system with 1024 processors and 512
nodes has 512k pcps with a high potential for large amount of memory being
caught in them.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-05-10 03:30:56 +0800
77461ab33 Make vm statistics update interval configurable ... Browse Code »

Make it configurable. Code in mm makes the vm statistics intervals
independent from the cache reaper use that opportunity to make it
configurable.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-05-10 03:30:56 +0800
d1187ed21 vmstat: use our own timer events ... Browse Code »

vmstat is currently using the cache reaper to periodically bring the
statistics up to date. The cache reaper does only exists in SLUB as a way to
provide compatibility with SLAB. This patch removes the vmstat calls from the
slab allocators and provides its own handling.

The advantage is also that we can use a different frequency for the updates.
Refreshing vm stats is a pretty fast job so we can run this every second and
stagger this by only one tick. This will lead to some overlap in large
systems. F.e a system running at 250 HZ with 1024 processors will have 4 vm
updates occurring at once.

However, the vm stats update only accesses per node information. It is only
necessary to stagger the vm statistics updates per processor in each node. Vm
counter updates occurring on distant nodes will not cause cacheline
contention.

We could implement an alternate approach that runs the first processor on each
node at the second and then each of the other processor on a node on a
subsequent tick. That may be useful to keep a large amount of the second free
of timer activity. Maybe the timer folks will have some feedback on this one?

[jirislaby@gmail.com: add missing break]
Cc: Arjan van de Ven
Signed-off-by: Christoph Lameter
Signed-off-by: Jiri Slaby
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-05-10 03:30:56 +0800
455c017ae microcode: use suspend-related CPU hotplug notifications ... Browse Code »

Make the microcode driver use the suspend-related CPU hotplug notifications
to handle the CPU hotplug events occuring during system-wide suspend and
resume transitions. Remove the global variable suspend_cpu_hotplug
previously used for this purpose.

Signed-off-by: Rafael J. Wysocki
Cc: Gautham R Shenoy
Cc: Pavel Machek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rafael J. Wysocki
2007-05-10 03:30:56 +0800
8bb784428 Add suspend-related notifications for CPU hotplug ... Browse Code »

Since nonboot CPUs are now disabled after tasks and devices have been
frozen and the CPU hotplug infrastructure is used for this purpose, we need
special CPU hotplug notifications that will help the CPU-hotplug-aware
subsystems distinguish normal CPU hotplug events from CPU hotplug events
related to a system-wide suspend or resume operation in progress. This
patch introduces such notifications and causes them to be used during
suspend and resume transitions. It also changes all of the
CPU-hotplug-aware subsystems to take these notifications into consideration
(for now they are handled in the same way as the corresponding "normal"
ones).

[oleg@tv-sign.ru: cleanups]
Signed-off-by: Rafael J. Wysocki
Cc: Gautham R Shenoy
Cc: Pavel Machek
Signed-off-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rafael J. Wysocki
2007-05-10 03:30:56 +0800
f37bc2712 fs: deprecate memclear_highpage_flush ... Browse Code »

Now that all the in-tree users are converted over to zero_user_page(),
deprecate the old memclear_highpage_flush() call.

Signed-off-by: Nate Diller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nate Diller
2007-05-10 03:30:56 +0800
f2fff5969 reiserfs: use zero_user_page ... Browse Code »

Use zero_user_page() instead of open-coding it.

Signed-off-by: Nate Diller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nate Diller
2007-05-10 03:30:56 +0800
0c11d7a9e ext3: use zero_user_page ... Browse Code »

Use zero_user_page() instead of open-coding it.

Signed-off-by: Nate Diller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nate Diller
2007-05-10 03:30:55 +0800
f36dca90e affs: use zero_user_page ... Browse Code »

Use zero_user_page() instead of open-coding it.

Signed-off-by: Nate Diller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nate Diller
2007-05-10 03:30:55 +0800
01f2705da fs: convert core functions to zero_user_page ... Browse Code »

It's very common for file systems to need to zero part or all of a page,
the simplist way is just to use kmap_atomic() and memset(). There's
actually a library function in include/linux/highmem.h that does exactly
that, but it's confusingly named memclear_highpage_flush(), which is
descriptive of *how* it does the work rather than what the *purpose* is.
So this patchset renames the function to zero_user_page(), and calls it
from the various places that currently open code it.

This first patch introduces the new function call, and converts all the
core kernel callsites, both the open-coded ones and the old
memclear_highpage_flush() ones. Following this patch is a series of
conversions for each file system individually, per AKPM, and finally a
patch deprecating the old call. The diffstat below shows the entire
patchset.

[akpm@linux-foundation.org: fix a few things]
Signed-off-by: Nate Diller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nate Diller
2007-05-10 03:30:55 +0800
38a23e311 timer: parenthesis fix in tbase_get_deferrable() etc ... Browse Code »

Signed-off-by: Jarek Poplawski
Cc: Thomas Gleixner
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jarek Poplawski
2007-05-10 03:30:55 +0800
34f01cc1f FUTEX: new PRIVATE futexes ... Browse Code »

Analysis of current linux futex code :
--------------------------------------

A central hash table futex_queues[] holds all contexts (futex_q) of waiting
threads.

Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot to
perform lookups or insert/deletion of a futex_q.

When a futex_wait() is done, calling thread has to :

1) - Obtain a read lock on mmap_sem to be able to validate the user pointer
(calling find_vma()). This validation tells us if the futex uses
an inode based store (mapped file), or mm based store (anonymous mem)

2) - compute a hash key

3) - Atomic increment of reference counter on an inode or a mm_struct

4) - lock part of futex_queues[] hash table

5) - perform the test on value of futex.
(rollback is value != expected_value, returns EWOULDBLOCK)
(various loops if test triggers mm faults)

6) queue the context into hash table, release the lock got in 4)

7) - release the read_lock on mmap_sem

8) Eventually unqueue the context (but rarely, as this part may be done
by the futex_wake())

Futexes were designed to improve scalability but current implementation has
various problems :

- Central hashtable :

This means scalability problems if many processes/threads want to use
futexes at the same time.
This means NUMA unbalance because this hashtable is located on one node.

- Using mmap_sem on every futex() syscall :

Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic
ops on mmap_sem, dirtying cache line :
- lot of cache line ping pongs on SMP configurations.

mmap_sem is also extensively used by mm code (page faults, mmap()/munmap())
Highly threaded processes might suffer from mmap_sem contention.

mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded
programs because of contention on the mmap_sem cache line.

- Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter:
It's also a cache line ping pong on SMP. It also increases mmap_sem hold time
because of cache misses.

Most of these scalability problems come from the fact that futexes are in
one global namespace. As we use a central hash table, we must make sure
they are all using the same reference (given by the mm subsystem). We
chose to force all futexes be 'shared'. This has a cost.

But fact is POSIX defined PRIVATE and SHARED, allowing clear separation,
and optimal performance if carefuly implemented. Time has come for linux
to have better threading performance.

The goal is to permit new futex commands to avoid :
- Taking the mmap_sem semaphore, conflicting with other subsystems.
- Modifying a ref_count on mm or an inode, still conflicting with mm or fs.

This is possible because, for one process using PTHREAD_PROCESS_PRIVATE
futexes, we only need to distinguish futexes by their virtual address, no
matter the underlying mm storage is.

If glibc wants to exploit this new infrastructure, it should use new
_PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And be
prepared to fallback on old subcommands for old kernels. Using one global
variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK.

PTHREAD_PROCESS_SHARED futexes should still use the old subcommands.

Compatibility with old applications is preserved, they still hit the
scalability problems, but new applications can fly :)

Note : the same SHARED futex (mapped on a file) can be used by old binaries
*and* new binaries, because both binaries will use the old subcommands.

Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic,
as this is the default semantic. Almost all applications should benefit
of this changes (new kernel and updated libc)

Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine)

/* calling futex_wait(addr, value) with value != *addr */
433 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes)
424 cycles per futex(FUTEX_WAIT) call (using one futex)
334 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes)
334 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex)
For reference :
187 cycles per getppid() call
188 cycles per umask() call
181 cycles per ni_syscall() call

Signed-off-by: Eric Dumazet
Pierre Peiffer
Cc: "Ulrich Drepper"
Cc: "Nick Piggin"
Cc: "Ingo Molnar"
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Dumazet
2007-05-10 03:30:55 +0800
d0aa7a70b futex_requeue_pi optimization ... Browse Code »

This patch provides the futex_requeue_pi functionality, which allows some
threads waiting on a normal futex to be requeued on the wait-queue of a
PI-futex.

This provides an optimization, already used for (normal) futexes, to be used
with the PI-futexes.

This optimization is currently used by the glibc in pthread_broadcast, when
using "normal" mutexes. With futex_requeue_pi, it can be used with
PRIO_INHERIT mutexes too.

Signed-off-by: Pierre Peiffer
Cc: Ingo Molnar
Cc: Ulrich Drepper
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pierre Peiffer
2007-05-10 03:30:55 +0800
c19384b5b Make futex_wait() use an hrtimer for timeout ... Browse Code »

This patch modifies futex_wait() to use an hrtimer + schedule() in place of
schedule_timeout().

schedule_timeout() is tick based, therefore the timeout granularity is the
tick (1 ms, 4 ms or 10 ms depending on HZ). By using a high resolution timer
for timeout wakeup, we can attain a much finer timeout granularity (in the
microsecond range). This parallels what is already done for futex_lock_pi().

The timeout passed to the syscall is no longer converted to jiffies and is
therefore passed to do_futex() and futex_wait() as an absolute ktime_t
therefore keeping nanosecond resolution.

Also this removes the need to pass the nanoseconds timeout part to
futex_lock_pi() in val2.

In futex_wait(), if there is no timeout then a regular schedule() is
performed. Otherwise, an hrtimer is fired before schedule() is called.

[akpm@linux-foundation.org: fix `make headers_check']
Signed-off-by: Sebastien Dugue
Signed-off-by: Pierre Peiffer
Cc: Ingo Molnar
Cc: Ulrich Drepper
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pierre Peiffer
2007-05-10 03:30:55 +0800
ec92d0829 futex priority based wakeup ... Browse Code »

Today, all threads waiting for a given futex are woken in FIFO order (first
waiter woken first) instead of priority order.

This patch makes use of plist (pirotity ordered lists) instead of simple list
in futex_hash_bucket.

All non-RT threads are stored with priority MAX_RT_PRIO, causing them to be
woken last, in FIFO order (RT-threads are woken first, in priority order).

Signed-off-by: Sebastien Dugue
Signed-off-by: Pierre Peiffer
Cc: Ingo Molnar
Cc: Ulrich Drepper
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pierre Peiffer
2007-05-10 03:30:55 +0800
f34c506b0 declare struct ktime ... Browse Code »

Some smarty went and inflicted ktime_t as a typedef upon us, so we cannot
forward declare it.

Create a new `union ktime', map ktime_t onto that. Now we need to kill off
this ktime_t thing.

Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: john stultz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2007-05-10 03:30:54 +0800
b8522ead3 aio is unlikely ... Browse Code »

Stick an unlikely() around is_aio(): I assert that most IO is synchronous.

Cc: Suparna Bhattacharya
Cc: Ingo Molnar
Cc: Benjamin LaHaise
Cc: Zach Brown
Cc: Ulrich Drepper
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2007-05-10 03:30:54 +0800
b41eeef14 knfsd: avoid Oops if buggy userspace performs confusing filehandle->dentry mapping ... Browse Code »

When a lookup request arrives, nfsd uses information provided by userspace
(mountd) to find the right filesystem.

It then assumes that the same filehandle type as the incoming filehandle can
be used to create an outgoing filehandle.

However if mountd is buggy, or maybe just being creative, the filesystem may
not support that filesystem type, and the kernel could oops, particularly if
'ex_uuid' is NULL but a FSID_UUID* filehandle type is used.

So add some proper checking that the fsid version/type from the incoming
filehandle is actually supportable, and ignore that information if it isn't
supportable.

Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

NeilBrown
2007-05-10 03:30:54 +0800
072f62ed8 knfsd: various nfsd xdr cleanups ... Browse Code »

1/ decode_sattr and decode_sattr3 never return NULL, so remove
several checks for that. ditto for xdr_decode_hyper.

2/ replace some open coded XDR_QUADLEN calls with calls to
XDR_QUADLEN

3/ in decode_writeargs, simply an 'if' to use a single
calculation.
.page_len is the length of that part of the packet that did
not fit in the first page (the head).
So the length of the data part is the remainder of the
head, plus page_len.

3/ other minor cleanups.

Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

NeilBrown
2007-05-10 03:30:54 +0800
f725b217b knfsd: trivial makefile cleanup ... Browse Code »

kbuild directly interprets -y as objects to build into a module,
no need to assign it to the old foo-objs variable.

Signed-off-by: Christoph Hellwig
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2007-05-10 03:30:54 +0800
05ed690ef knfsd: simplify a 'while' condition in svcsock.c ... Browse Code »

This while loop has an overly complex condition, which performs a couple of
assignments. This hurts readability.

We don't really need a loop at all. We can just return -EAGAIN and (providing
we set SK_DATA), the function will be called again.

So discard the loop, make the complex conditional become a few clear function
calls, and hopefully improve readability.

Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

NeilBrown
2007-05-10 03:30:54 +0800
c5e434c98 knfsd: rpcgss: RPC_GSS_PROC_ DESTROY request will get a bad rpc ... Browse Code »

If I send a RPC_GSS_PROC_DESTROY message to NFSv4 server, it will reply with a
bad rpc reply which lacks an authentication verifier. Maybe this patch is
needed.

Send/recv packets as following:

send:

RemoteProcedureCall
xid
rpcvers = 2
prog = 100003
vers = 4
proc = 0
cred = AUTH_GSS
version = 1
gss_proc = 3 (RPCSEC_GSS_DESTROY)
service = 1 (RPC_GSS_SVC_NONE)
verf = AUTH_GSS
checksum

reply:

RemoteProcedureReply
xid
msg_type
reply_stat
accepted_reply

Signed-off-by: Wei Yongjun
Signed-off-by: "J. Bruce Fields"
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wei Yongjun
2007-05-10 03:30:54 +0800
54f9247b3 knfsd: fix resource leak resulting in module refcount leak for rpcsec_gss_krb5.ko ... Browse Code »

I have been investigating a module reference count leak on the server for
rpcsec_gss_krb5.ko. It turns out the problem is a reference count leak for
the security context in net/sunrpc/auth_gss/svcauth_gss.c.

The problem is that gss_write_init_verf() calls gss_svc_searchbyctx() which
does a rsc_lookup() but never releases the reference to the context. There is
another issue that rpc.svcgssd sets an "end of time" expiration for the
context

By adding a cache_put() call in gss_svc_searchbyctx(), and setting an
expiration timeout in the downcall, cache_clean() does clean up the context
and the module reference count now goes to zero after unmount.

I also verified that if the context expires and then the client makes a new
request, a new context is established.

Here is the patch to fix the kernel, I will start a separate thread to discuss
what expiration time should be set by rpc.svcgssd.

Acked-by: "J. Bruce Fields"
Signed-off-by: Frank Filz
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Frank Filz
2007-05-10 03:30:54 +0800
153e44d22 knfsd: rpc: fix server-side wrapping of krb5i replies ... Browse Code »

It's not necessarily correct to assume that the xdr_buf used to hold the
server's reply must have page data whenever it has tail data.

And there's no need for us to deal with that case separately anyway.

Acked-by: "J. Bruce Fields"
Signed-off-by: Neil Brown
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

NeilBrown
2007-05-10 03:30:54 +0800
402acd29e knfsd: avoid use of unitialised variables on error path when nfs exports ... Browse Code »

We need to zero various parts of 'exp' before any 'goto out', otherwise when
we go to free the contents... we die.

Signed-off-by: Neil Brown
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

NeilBrown
2007-05-10 03:30:54 +0800
5bd5f5812 sunrpc: fix error path in module_init ... Browse Code »

register_rpc_pipefs() needs to clean up rpc_inode_cache
by kmem_cache_destroy() on register_filesystem() failure.

init_sunrpc() needs to unregister rpc_pipe_fs by unregister_rpc_pipefs()
when rpc_init_mempool() returns error.

Signed-off-by: Akinobu Mita
Cc: Neil Brown
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2007-05-10 03:30:54 +0800
cd123012d RPC: add wrapper for svc_reserve to account for checksum ... Browse Code »

When the kernel calls svc_reserve to downsize the expected size of an RPC
reply, it fails to account for the possibility of a checksum at the end of
the packet. If a client mounts a NFSv2/3 with sec=krb5i/p, and does I/O
then you'll generally see messages similar to this in the server's ring
buffer:

RPC request reserved 164 but used 208

While I was never able to verify it, I suspect that this problem is also
the root cause of some oopses I've seen under these conditions:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=227726

This is probably also a problem for other sec= types and for NFSv4. The
large reserved size for NFSv4 compound packets seems to generally paper
over the problem, however.

This patch adds a wrapper for svc_reserve that accounts for the possibility
of a checksum. It also fixes up the appropriate callers of svc_reserve to
call the wrapper. For now, it just uses a hardcoded value that I
determined via testing. That value may need to be revised upward as things
change, or we may want to eventually add a new auth_op that attempts to
calculate this somehow.

Unfortunately, there doesn't seem to be a good way to reliably determine
the expected checksum length prior to actually calculating it, particularly
with schemes like spkm3.

Signed-off-by: Jeff Layton
Acked-by: Neil Brown
Cc: Trond Myklebust
Acked-by: J. Bruce Fields
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeff Layton
2007-05-10 03:30:54 +0800
669716433 nfsd/nfs4state: remove unnecessary daemonize call ... Browse Code »

Acked-by: Neil Brown
Cc: Trond Myklebust
Signed-off-by: Eric W. Biederman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2007-05-10 03:30:54 +0800
7ac1bea55 knfsd: rename sk_defer_lock to sk_lock ... Browse Code »

Now that sk_defer_lock protects two different things, make the name more
generic.

Also don't bother with disabling _bh as the lock is only ever taken from
process context.

Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

NeilBrown
2007-05-10 03:30:54 +0800
f34b95689 The NFSv2/NFSv3 server does not handle zero length WRITE requests correctly ... Browse Code »

The NFSv2 and NFSv3 servers do not handle WRITE requests for 0 bytes
correctly. The specifications indicate that the server should accept the
request, but it should mostly turn into a no-op. Currently, the server
will return an XDR decode error, which it should not.

Attached is a patch which addresses this issue. It also adds some boundary
checking to ensure that the request contains as much data as was requested
to be written. It also correctly handles an NFSv3 request which requests
to write more data than the server has stated that it is prepared to
handle. Previously, there was some support which looked like it should
work, but wasn't quite right.

Signed-off-by: Peter Staubach
Acked-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Staubach
2007-05-10 03:30:54 +0800
8842c9655 remove nfs4_acl_add_ace() ... Browse Code »

nfs4_acl_add_ace() can now be removed.

Signed-off-by: Adrian Bunk
Acked-by: Neil Brown
Acked-by: J. Bruce Fields
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2007-05-10 03:30:54 +0800
6e84d644b make cancel_rearming_delayed_work() reliable ... Browse Code »

Thanks to Jarek Poplawski for the ideas and for spotting the bug in the
initial draft patch.

cancel_rearming_delayed_work() currently has many limitations, because it
requires that dwork always re-arms itself via queue_delayed_work(). So it
hangs forever if dwork doesn't do this, or cancel_rearming_delayed_work/
cancel_delayed_work was already called. It uses flush_workqueue() in a
loop, so it can't be used if workqueue was freezed, and it is potentially
live- lockable on busy system if delay is small.

With this patch cancel_rearming_delayed_work() doesn't make any assumptions
about dwork, it can re-arm itself via queue_delayed_work(), or
queue_work(), or do nothing.

As a "side effect", cancel_work_sync() was changed to handle re-arming works
as well.

Disadvantages:

- this patch adds wmb() to insert_work().

- slowdowns the fast path (when del_timer() succeeds on entry) of
cancel_rearming_delayed_work(), because wait_on_work() is called
unconditionally. In that case, compared to the old version, we are
doing "unneeded" lock/unlock for each online CPU.

On the other hand, this means we don't need to use cancel_work_sync()
after cancel_rearming_delayed_work().

- complicates the code (.text grows by 130 bytes).

[akpm@linux-foundation.org: fix speling]
Signed-off-by: Oleg Nesterov
Cc: David Chinner
Cc: David Howells
Cc: Gautham Shenoy
Acked-by: Jarek Poplawski
Cc: Srivatsa Vaddagiri
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2007-05-10 03:30:53 +0800
7b0834c26 Remove kthread_bind() call from _cpu_down() ... Browse Code »

We are anyway kthread_stop()ping other per-cpu kernel threads after
move_task_off_dead_cpu(), so we can do it with the stop_machine_run thread
as well.

I just checked with Vatsa if there was any subtle reason why they
had put in the kthread_bind() in cpu.c. Vatsa cannot seem to recollect
any and I can't see any. So let us just remove the kthread_bind.

Signed-off-by: Gautham R Shenoy
Cc: Oleg Nesterov
Cc: "Eric W. Biederman"
Cc: "Rafael J. Wysocki"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gautham R Shenoy
2007-05-10 03:30:53 +0800
10ab825bd change kernel threads to ignore signals instead of blocking them ... Browse Code »

Currently kernel threads use sigprocmask(SIG_BLOCK) to protect against
signals. This doesn't prevent the signal delivery, this only blocks
signal_wake_up(). Every "killall -33 kthreadd" means a "struct siginfo"
leak.

Change kthreadd_setup() to set all handlers to SIG_IGN instead of blocking
them (make a new helper ignore_signals() for that). If the kernel thread
needs some signal, it should use allow_signal() anyway, and in that case it
should not use CLONE_SIGHAND.

Note that we can't change daemonize() (should die!) in the same way,
because it can be used along with CLONE_SIGHAND. This means that
allow_signal() still should unblock the signal to work correctly with
daemonize()ed threads.

However, disallow_signal() doesn't block the signal any longer but ignores
it.

NOTE: with or without this patch the kernel threads are not protected from
handle_stop_signal(), this seems harmless, but not good.

Signed-off-by: Oleg Nesterov
Acked-by: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2007-05-10 03:30:53 +0800
5de18d169 worker_thread: don't play with SIGCHLD and numa policy ... Browse Code »

worker_thread() inherits ignored SIGCHLD and numa_default_policy() from its
parent, kthreadd. No need to setup this again.

Signed-off-by: Oleg Nesterov
Acked-by: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2007-05-10 03:30:53 +0800
90cce03d9 wait_for_helper: remove unneeded do_sigaction() ... Browse Code »

allow_signal(SIGCHLD) does all necessary job, no need to call do_sigaction()
prior to.

Signed-off-by: Oleg Nesterov
Cc: Rusty Russell
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2007-05-10 03:30:53 +0800