Eric Lee / linux-smarc-t335x-v3.2

08 Sep, 2005

19 commits

0811bab24 [PATCH] cpusets: re-enable "dynamic sched domains" ... Browse Code »

Revert the hack introduced last week.

Signed-off-by: John Hawkes
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

John Hawkes
2005-09-08 07:57:41 +0800
d1b551386 [PATCH] cpusets: fix the "dynamic sched domains" bug ... Browse Code »

For a NUMA system with multiple CPUs per node, declaring a cpu-exclusive
cpuset that includes only some, but not all, of the CPUs in a node will mangle
the sched domain structures.

Signed-off-by: John Hawkes
Cc; Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

John Hawkes
2005-09-08 07:57:41 +0800
9c1cfda20 [PATCH] cpusets: Move the ia64 domain setup code to the generic code ... Browse Code »

Signed-off-by: John Hawkes
Signed-off-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

John Hawkes
2005-09-08 07:57:40 +0800
ef08e3b49 [PATCH] cpusets: confine oom_killer to mem_exclusive cpuset ... Browse Code »

Now the real motivation for this cpuset mem_exclusive patch series seems
trivial.

This patch keeps a task in or under one mem_exclusive cpuset from provoking an
oom kill of a task under a non-overlapping mem_exclusive cpuset. Since only
interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive
containment, there is little to gain from oom killing a task under a
non-overlapping mem_exclusive cpuset, as almost all kernel and user memory
allocation must come from disjoint memory nodes.

This patch enables configuring a system so that a runaway job under one
mem_exclusive cpuset cannot cause the killing of a job in another such cpuset
that might be using very high compute and memory resources for a prolonged
time.

Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2005-09-08 07:57:40 +0800
9bf2229f8 [PATCH] cpusets: formalize intermediate GFP_KERNEL containment ... Browse Code »

This patch makes use of the previously underutilized cpuset flag
'mem_exclusive' to provide what amounts to another layer of memory placement
resolution. With this patch, there are now the following four layers of
memory placement available:

1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
3) The current tasks cpuset (GFP_USER allocations constrained to here), and
4) Specific node placement, using mbind and set_mempolicy.

These nest - each layer is a subset (same or within) of the previous.

Layer (2) above is new, with this patch. The call used to check whether a
zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
extended to take a gfp_mask argument, and its logic is extended, in the case
that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
placement is allowed. The definition of GFP_USER, which used to be identical
to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
cpuset_gfp_hardwall_flag patch.

GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
cpuset, so long as any node therein is not too tight on memory, but will
escape to the larger layer, if need be.

The intended use is to allow something like a batch manager to handle several
jobs, each job in its own cpuset, but using common kernel memory for caches
and such. Swapper and oom_kill activity is also constrained to Layer (2). A
task in or below one mem_exclusive cpuset should not cause swapping on nodes
in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
task in another such cpuset. Heavy use of kernel memory for i/o caching and
such by one job should not impact the memory available to jobs in other
non-overlapping mem_exclusive cpusets.

This patch enables providing hardwall, inescapable cpusets for memory
allocations of each job, while sharing kernel memory allocations between
several jobs, in an enclosing mem_exclusive cpuset.

Like Dinakar's patch earlier to enable administering sched domains using the
cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
that had previously done nothing much useful other than restrict what cpuset
configurations were allowed.

Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2005-09-08 07:57:40 +0800
39ed3fdee [PATCH] futex: remove duplicate code ... Browse Code »

This patch cleans up the error path of futex_fd() by removing duplicate
code.

Signed-off-by: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pekka Enberg
2005-09-08 07:57:33 +0800
e752dd6cc [PATCH] fix send_sigqueue() vs thread exit race ... Browse Code »

posix_timer_event() first checks that the thread (SIGEV_THREAD_ID case)
does not have PF_EXITING flag, then it calls send_sigqueue() which locks
task list. But if the thread exits in between the kernel will oops
(->sighand == NULL after __exit_sighand).

This patch moves the PF_EXITING check into the send_sigqueue(), it must be
done atomically under tasklist_lock. When send_sigqueue() detects exiting
thread it returns -1. In that case posix_timer_event will send the signal
to thread group.

Also, this patch fixes task_struct use-after-free in posix_timer_event.

Signed-off-by: Oleg Nesterov
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2005-09-08 07:57:33 +0800
0730ded5b [PATCH] remove a redundant variable in sys_prctl() ... Browse Code »

The patch removes a redundant variable `sig' from sys_prctl().

For some reason, when sys_prctl is called with option == PR_SET_PDEATHSIG
then the value of arg2 is assigned to an int variable named sig. Then sig
is tested with valid_signal() and later used to set the value of
current->pdeath_signal .

There is no reason to use this intermediate variable since valid_signal()
takes a unsigned long argument, so it can handle being passed arg2
directly, and if the call to valid_signal is OK, then we know the value of
arg2 is in the range zero to _NSIG and thus it'll easily fit in a plain int
and thus there's no problem assigning it later to current->pdeath_signal
(which is an int).

The patch gets rid of the pointless variable `sig'.
This reduces the size of kernel/sys.o in 2.6.13-rc6-mm1 by 32 bytes on my
system.

Patch has been compile tested, boot tested, and just to make damn sure I
didn't break anything I wrote a quick test app that calls
prctl(PR_SET_PDEATHSIG ...) with the entire range of values for a
unsigned long, and it behaves as expected with and without the patch.

Signed-off-by: Jesper Juhl
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jesper Juhl
2005-09-08 07:57:32 +0800
6c9c0b52b [PATCH] largefile support for accounting ... Browse Code »

There is a problem in the accounting subsystem in the kernel can not
correctly handle files larger than 2GB. The output file containing the
process accounting data can grow very large if the system is large enough
and active enough. If the 2GB limit is reached, then the system simply
stops storing process accounting data.

Another annoying problem is that once the system reaches this 2GB limit,
then every process which exits will receive a signal, SIGXFSZ. This signal
is generated because an attempt was made to write beyond the limit for the
file descriptor. This signal makes it look like every process has exited
due to a signal, when in fact, they have not.

The solution is to add the O_LARGEFILE flag to the list of flags used to
open the accounting file. The rest of the accounting support is already
largefile safe.

The changes were tested by constructing a large file (just short of 2GB),
enabling accounting, and then running enough commands to cause the
accounting data generated to increase the size of the file to 2GB. Without
the changes, the file grows to 2GB and the last command run in the test
script appears to exit due a signal when it has not. With the changes,
things work as expected and quietly.

There are some user level changes required so that it can deal with
largefiles, but those are being handled separately.

Signed-off-by: Peter Staubach
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Staubach
2005-09-08 07:57:31 +0800
bc505a478 [PATCH] do_notify_parent_cldstop() cleanup ... Browse Code »

This patch simplifies the usage of do_notify_parent_cldstop(), it lessens
the source and .text size slightly, and makes the code (in my opinion) a
bit more readable.

I am sending this patch now because I'm afraid Paul will touch
do_notify_parent_cldstop() really soon, It's better to cleanup first.

Signed-off-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2005-09-08 07:57:31 +0800
f26fdd599 [PATCH] CHECK_IRQ_PER_CPU() to avoid dead code in __do_IRQ() ... Browse Code »

IRQ_PER_CPU is not used by all architectures. This patch introduces the
macros ARCH_HAS_IRQ_PER_CPU and CHECK_IRQ_PER_CPU() to avoid the generation
of dead code in __do_IRQ().

ARCH_HAS_IRQ_PER_CPU is defined by architectures using IRQ_PER_CPU in their
include/asm_ARCH/irq.h file.

Through grepping the tree I found the following architectures currently use
IRQ_PER_CPU:

cris, ia64, ppc, ppc64 and parisc.

Signed-off-by: Karsten Wiese
Acked-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Karsten Wiese
2005-09-08 07:57:29 +0800
230649da7 [PATCH] create_workqueue_thread() signedness fix ... Browse Code »

With "-W -Wno-unused -Wno-sign-compare" I get the following compile warning:

CC kernel/workqueue.o
kernel/workqueue.c: In function `workqueue_cpu_callback':
kernel/workqueue.c:504: warning: ordered comparison of pointer with integer zero

On error create_workqueue_thread() returns NULL, not negative pointer, so
following trivial patch suggests itself.

Signed-off-by: Mika Kukkonen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mika Kukkonen
2005-09-08 07:57:28 +0800
378bac820 [PATCH] flush icache early when loading module ... Browse Code »

Change the sequence of operations performed during module loading to flush
the instruction cache before module parameters are processed. If a module
has parameters of an unusual type that cannot be handled using the standard
accessor functions param_set_xxx and param_get_xxx, it has to to provide a
set of accessor functions for this type. This requires module code to be
executed during parameter processing, which is of course only possible
after the icache has been flushed.

Signed-off-by: Thomas Koeller
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Thomas Koeller
2005-09-08 07:57:26 +0800
486d46aef [PATCH] optimize writer path in time_interpolator_get_counter() ... Browse Code »

Christoph Lameter

When using a time interpolator that is susceptible to jitter there's
potentially contention over a cmpxchg used to prevent time from going
backwards. This is unnecessary when the caller holds the xtime write
seqlock as all readers will be blocked from returning until the write is
complete. We can therefore allow writers to insert a new value and exit
rather than fight with CPUs who only hold a reader lock.

Signed-off-by: Alex Williamson
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alex Williamson
2005-09-08 07:57:24 +0800
fe21773d6 [PATCH] Provide better printk() support for SMP machines ... Browse Code »

The attached patch prevents oopses interleaving with characters from
other printks on other CPUs by only breaking the lock if the oops is
happening on the machine holding the lock.

It might be better if the oops generator got the lock and then called an
inner vprintk routine that assumed the caller holds the lock, thus
making oops reports "atomic".

Signed-Off-By: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Howells
2005-09-08 07:57:18 +0800
8446f1d39 [PATCH] detect soft lockups ... Browse Code »

This patch adds a new kernel debug feature: CONFIG_DETECT_SOFTLOCKUP.

When enabled then per-CPU watchdog threads are started, which try to run
once per second. If they get delayed for more than 10 seconds then a
callback from the timer interrupt detects this condition and prints out a
warning message and a stack dump (once per lockup incident). The feature
is otherwise non-intrusive, it doesnt try to unlock the box in any way, it
only gets the debug info out, automatically, and on all CPUs affected by
the lockup.

Signed-off-by: Ingo Molnar
Signed-off-by: Nishanth Aravamudan
Signed-Off-By: Matthias Urlichs
Signed-off-by: Richard Purdie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ingo Molnar
2005-09-08 07:57:17 +0800
4732efbeb [PATCH] FUTEX_WAKE_OP: pthread_cond_signal() speedup ... Browse Code »

ATM pthread_cond_signal is unnecessarily slow, because it wakes one waiter
(which at least on UP usually means an immediate context switch to one of
the waiter threads). This waiter wakes up and after a few instructions it
attempts to acquire the cv internal lock, but that lock is still held by
the thread calling pthread_cond_signal. So it goes to sleep and eventually
the signalling thread is scheduled in, unlocks the internal lock and wakes
the waiter again.

Now, before 2003-09-21 NPTL was using FUTEX_REQUEUE in pthread_cond_signal
to avoid this performance issue, but it was removed when locks were
redesigned to the 3 state scheme (unlocked, locked uncontended, locked
contended).

Following scenario shows why simply using FUTEX_REQUEUE in
pthread_cond_signal together with using lll_mutex_unlock_force in place of
lll_mutex_unlock is not enough and probably why it has been disabled at
that time:

The number is value in cv->__data.__lock.
thr1 thr2 thr3
0 pthread_cond_wait
1 lll_mutex_lock (cv->__data.__lock)
0 lll_mutex_unlock (cv->__data.__lock)
0 lll_futex_wait (&cv->__data.__futex, futexval)
0 pthread_cond_signal
1 lll_mutex_lock (cv->__data.__lock)
1 pthread_cond_signal
2 lll_mutex_lock (cv->__data.__lock)
2 lll_futex_wait (&cv->__data.__lock, 2)
2 lll_futex_requeue (&cv->__data.__futex, 0, 1, &cv->__data.__lock)
# FUTEX_REQUEUE, not FUTEX_CMP_REQUEUE
2 lll_mutex_unlock_force (cv->__data.__lock)
0 cv->__data.__lock = 0
0 lll_futex_wake (&cv->__data.__lock, 1)
1 lll_mutex_lock (cv->__data.__lock)
0 lll_mutex_unlock (cv->__data.__lock)
# Here, lll_mutex_unlock doesn't know there are threads waiting
# on the internal cv's lock

Now, I believe it is possible to use FUTEX_REQUEUE in pthread_cond_signal,
but it will cost us not one, but 2 extra syscalls and, what's worse, one of
these extra syscalls will be done for every single waiting loop in
pthread_cond_*wait.

We would need to use lll_mutex_unlock_force in pthread_cond_signal after
requeue and lll_mutex_cond_lock in pthread_cond_*wait after lll_futex_wait.

Another alternative is to do the unlocking pthread_cond_signal needs to do
(the lock can't be unlocked before lll_futex_wake, as that is racy) in the
kernel.

I have implemented both variants, futex-requeue-glibc.patch is the first
one and futex-wake_op{,-glibc}.patch is the unlocking inside of the kernel.
The kernel interface allows userland to specify how exactly an unlocking
operation should look like (some atomic arithmetic operation with optional
constant argument and comparison of the previous futex value with another
constant).

It has been implemented just for ppc*, x86_64 and i?86, for other
architectures I'm including just a stub header which can be used as a
starting point by maintainers to write support for their arches and ATM
will just return -ENOSYS for FUTEX_WAKE_OP. The requeue patch has been
(lightly) tested just on x86_64, the wake_op patch on ppc64 kernel running
32-bit and 64-bit NPTL and x86_64 kernel running 32-bit and 64-bit NPTL.

With the following benchmark on UP x86-64 I get:

for i in nptl-orig nptl-requeue nptl-wake_op; do echo time elf/ld.so --library-path .:$i /tmp/bench; \
for j in 1 2; do echo ( time elf/ld.so --library-path .:$i /tmp/bench ) 2>&1; done; done
time elf/ld.so --library-path .:nptl-orig /tmp/bench
real 0m0.655s user 0m0.253s sys 0m0.403s
real 0m0.657s user 0m0.269s sys 0m0.388s
time elf/ld.so --library-path .:nptl-requeue /tmp/bench
real 0m0.496s user 0m0.225s sys 0m0.271s
real 0m0.531s user 0m0.242s sys 0m0.288s
time elf/ld.so --library-path .:nptl-wake_op /tmp/bench
real 0m0.380s user 0m0.176s sys 0m0.204s
real 0m0.382s user 0m0.175s sys 0m0.207s

The benchmark is at:
http://sourceware.org/ml/libc-alpha/2005-03/txt00001.txt
Older futex-requeue-glibc.patch version is at:
http://sourceware.org/ml/libc-alpha/2005-03/txt00002.txt
Older futex-wake_op-glibc.patch version is at:
http://sourceware.org/ml/libc-alpha/2005-03/txt00003.txt
Will post a new version (just x86-64 fixes so that the patch
applies against pthread_cond_signal.S) to libc-hacker ml soon.

Attached is the kernel FUTEX_WAKE_OP patch as well as a simple-minded
testcase that will not test the atomicity of the operation, but at least
check if the threads that should have been woken up are woken up and
whether the arithmetic operation in the kernel gave the expected results.

Acked-by: Ingo Molnar
Cc: Ulrich Drepper
Cc: Jamie Lokier
Cc: Rusty Russell
Signed-off-by: Yoichi Yuasa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jakub Jelinek
2005-09-08 07:57:17 +0800
d7ae79c72 [PATCH] swsusp: update documentation ... Browse Code »

This updates documentation a bit (mostly removing obsolete stuff), and
marks swsusp as no longer experimental in config.

Signed-off-by: Pavel Machek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Machek
2005-09-08 07:57:16 +0800
54d5d4240 [PATCH] x86/x86_64: deferred handling of writes to /proc/irqxx/smp_affinity ... Browse Code »

When handling writes to /proc/irq, current code is re-programming rte
entries directly. This is not recommended and could potentially cause
chipset's to lockup, or cause missing interrupts.

CONFIG_IRQ_BALANCE does this correctly, where it re-programs only when the
interrupt is pending. The same needs to be done for /proc/irq handling as well.
Otherwise user space irq balancers are really not doing the right thing.

- Changed pending_irq_balance_cpumask to pending_irq_migrate_cpumask for
lack of a generic name.
- added move_irq out of IRQ_BALANCE, and added this same to X86_64
- Added new proc handler for write, so we can do deferred write at irq
handling time.
- Display of /proc/irq/XX/smp_affinity used to display CPU_MASKALL, instead
it now shows only active cpu masks, or exactly what was set.
- Provided a common move_irq implementation, instead of duplicating
when using generic irq framework.

Tested on i386/x86_64 and ia64 with CONFIG_PCI_MSI turned on and off.
Tested UP builds as well.

MSI testing: tbd: I have cards, need to look for a x-over cable, although I
did test an earlier version of this patch. Will test in a couple days.

Signed-off-by: Ashok Raj
Acked-by: Zwane Mwaikambo
Grudgingly-acked-by: Andi Kleen
Signed-off-by: Coywolf Qi Hunt
Signed-off-by: Ashok Raj
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ashok Raj
2005-09-08 07:57:15 +0800

05 Sep, 2005

9 commits

ed75e8d58 [PATCH] UML Support - Ptrace: adds the host SYSEMU support, for UML and general usage ... Browse Code »

Jeff Dike ,
Paolo 'Blaisorblade' Giarrusso ,
Bodo Stroesser

Adds a new ptrace(2) mode, called PTRACE_SYSEMU, resembling PTRACE_SYSCALL
except that the kernel does not execute the requested syscall; this is useful
to improve performance for virtual environments, like UML, which want to run
the syscall on their own.

In fact, using PTRACE_SYSCALL means stopping child execution twice, on entry
and on exit, and each time you also have two context switches; with SYSEMU you
avoid the 2nd stop and so save two context switches per syscall.

Also, some architectures don't have support in the host for changing the
syscall number via ptrace(), which is currently needed to skip syscall
execution (UML turns any syscall into getpid() to avoid it being executed on
the host). Fixing that is hard, while SYSEMU is easier to implement.

* This version of the patch includes some suggestions of Jeff Dike to avoid
adding any instructions to the syscall fast path, plus some other little
changes, by myself, to make it work even when the syscall is executed with
SYSENTER (but I'm unsure about them). It has been widely tested for quite a
lot of time.

* Various fixed were included to handle the various switches between
various states, i.e. when for instance a syscall entry is traced with one of
PT_SYSCALL / _SYSEMU / _SINGLESTEP and another one is used on exit.
Basically, this is done by remembering which one of them was used even after
the call to ptrace_notify().

* We're combining TIF_SYSCALL_EMU with TIF_SYSCALL_TRACE or TIF_SINGLESTEP
to make do_syscall_trace() notice that the current syscall was started with
SYSEMU on entry, so that no notification ought to be done in the exit path;
this is a bit of a hack, so this problem is solved in another way in next
patches.

* Also, the effects of the patch:
"Ptrace - i386: fix Syscall Audit interaction with singlestep"
are cancelled; they are restored back in the last patch of this series.

Detailed descriptions of the patches doing this kind of processing follow (but
I've already summed everything up).

* Fix behaviour when changing interception kind #1.

In do_syscall_trace(), we check the status of the TIF_SYSCALL_EMU flag
only after doing the debugger notification; but the debugger might have
changed the status of this flag because he continued execution with
PTRACE_SYSCALL, so this is wrong. This patch fixes it by saving the flag
status before calling ptrace_notify().

* Fix behaviour when changing interception kind #2:
avoid intercepting syscall on return when using SYSCALL again.

A guest process switching from using PTRACE_SYSEMU to PTRACE_SYSCALL
crashes.

The problem is in arch/i386/kernel/entry.S. The current SYSEMU patch
inhibits the syscall-handler to be called, but does not prevent
do_syscall_trace() to be called after this for syscall completion
interception.

The appended patch fixes this. It reuses the flag TIF_SYSCALL_EMU to
remember "we come from PTRACE_SYSEMU and now are in PTRACE_SYSCALL", since
the flag is unused in the depicted situation.

* Fix behaviour when changing interception kind #3:
avoid intercepting syscall on return when using SINGLESTEP.

When testing 2.6.9 and the skas3.v6 patch, with my latest patch and had
problems with singlestepping on UML in SKAS with SYSEMU. It looped
receiving SIGTRAPs without moving forward. EIP of the traced process was
the same for all SIGTRAPs.

What's missing is to handle switching from PTRACE_SYSCALL_EMU to
PTRACE_SINGLESTEP in a way very similar to what is done for the change from
PTRACE_SYSCALL_EMU to PTRACE_SYSCALL_TRACE.

I.e., after calling ptrace(PTRACE_SYSEMU), on the return path, the debugger is
notified and then wake ups the process; the syscall is executed (or skipped,
when do_syscall_trace() returns 0, i.e. when using PTRACE_SYSEMU), and
do_syscall_trace() is called again. Since we are on the return path of a
SYSEMU'd syscall, if the wake up is performed through ptrace(PTRACE_SYSCALL),
we must still avoid notifying the parent of the syscall exit. Now, this
behaviour is extended even to resuming with PTRACE_SINGLESTEP.

Signed-off-by: Paolo 'Blaisorblade' Giarrusso
Cc: Jeff Dike
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Laurent Vivier
2005-09-05 15:06:20 +0800
57c4ce3cb [PATCH] pm: clean up /sys/power/disk ... Browse Code »

Clean code up a bit, and only show suspend to disk as available when
it is configured in.

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Machek
2005-09-05 15:06:18 +0800
6161b2ce8 [PATCH] pm: fix process freezing ... Browse Code »

If process freezing fails, some processes are frozen, and rest are left in
"were asked to be frozen" state. Thats wrong, we should leave it in some
consistent state.

Signed-off-by: Pavel Machek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Machek
2005-09-05 15:06:17 +0800
99dc7d63e [PATCH] swsusp: fix error handling and cleanups ... Browse Code »

Drop printing during normal boot (when no image exists in swap), print
message when drivers fail, fix error paths and consolidate near-identical
functions in disk.c (and functions with just one statement).

Signed-off-by: Pavel Machek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Machek
2005-09-05 15:06:17 +0800
dd5d666b7 [PATCH] swsusp: add locking to software_resume ... Browse Code »

It is trying to protect swsusp_resume_device and software_resume() from two
users banging it from userspace at the same time.

Signed-off-by: Shaohua Li
Signed-off-by: Pavel Machek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2005-09-05 15:06:17 +0800
56057e1a1 [PATCH] swsusp: simpler calculation of number of pages in PBE list ... Browse Code »

The function calc_nr uses an iterative algorithm to calculate the number of
pages needed for the image and the pagedir. Exactly the same result can be
obtained with a one-line expression.

Note that this was even proved correct ;-).

Signed-off-by: Michal Schmidt
Signed-off-by: Pavel Machek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Schmidt
2005-09-05 15:06:17 +0800
c2ff18f40 [PATCH] encrypt suspend data for easy wiping ... Browse Code »

The patch protects from leaking sensitive data after resume from suspend.
During suspend a temporary key is created and this key is used to encrypt the
data written to disk. When, during resume, the data was read back into memory
the temporary key is destroyed which simply means that all data written to
disk during suspend are then inaccessible so they can't be stolen lateron.

Think of the following: you suspend while an application is running that keeps
sensitive data in memory. The application itself prevents the data from being
swapped out. Suspend, however, must write these data to swap to be able to
resume lateron. Without suspend encryption your sensitive data are then
stored in plaintext on disk. This means that after resume your sensitive data
are accessible to all applications having direct access to the swap device
which was used for suspend. If you don't need swap after resume these data
can remain on disk virtually forever. Thus it can happen that your system
gets broken in weeks later and sensitive data which you thought were encrypted
and protected are retrieved and stolen from the swap device.

Signed-off-by: Andreas Steinmetz
Acked-by: Pavel Machek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andreas Steinmetz
2005-09-05 15:06:16 +0800
2a23b5d1e [PATCH] remove busywait in refrigerator ... Browse Code »

This should make refrigerator sleep properly, not busywait after the first
schedule() returns.

Signed-off-by: Pavel Machek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Machek
2005-09-05 15:06:14 +0800
dae06ac43 [PATCH] swap: update swsusp use of swap_info ... Browse Code »

Aha, swsusp dips into swap_info[], better update it to swap_lock. It's
bitflipping flags with 0xFF, so get_swap_page will allocate from only the one
chosen device: let's change that to flip SWP_WRITEOK.

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2005-09-05 15:05:42 +0800

30 Aug, 2005

3 commits

20380731b [NET]: Fix sparse warnings ... Browse Code »

Of this type, mostly:

CHECK net/ipv6/netfilter.c
net/ipv6/netfilter.c:96:12: warning: symbol 'ipv6_netfilter_init' was not declared. Should it be static?
net/ipv6/netfilter.c:101:6: warning: symbol 'ipv6_netfilter_fini' was not declared. Should it be static?

Signed-off-by: Arnaldo Carvalho de Melo
Signed-off-by: David S. Miller

Arnaldo Carvalho de Melo
2005-08-30 07:01:32 +0800
066286071 [NETLINK]: Add "groups" argument to netlink_kernel_create ... Browse Code »

Signed-off-by: Patrick McHardy
Signed-off-by: David S. Miller

Patrick McHardy
2005-08-30 07:01:11 +0800
4fdb3bb72 [NETLINK]: Add properly module refcounting for kernel netlink sockets. ... Browse Code »

- Remove bogus code for compiling netlink as module
- Add module refcounting support for modules implementing a netlink
protocol
- Add support for autoloading modules that implement a netlink protocol
as soon as someone opens a socket for that protocol

Signed-off-by: Harald Welte
Signed-off-by: David S. Miller

Harald Welte
2005-08-30 06:35:08 +0800

27 Aug, 2005

2 commits

212d6d223 [PATCH] completely disable cpu_exclusive sched domain ... Browse Code »

At the suggestion of Nick Piggin and Dinakar, totally disable
the facility to allow cpu_exclusive cpusets to define dynamic
sched domains in Linux 2.6.13, in order to avoid problems
first reported by John Hawkes (corrupt sched data structures
and kernel oops).

This has been built for ppc64, i386, ia64, x86_64, sparc, alpha.
It has been built, booted and tested for cpuset functionality
on an SN2 (ia64).

Dinakar or Nick - could you verify that it for sure does avoid
the problems Hawkes reported. Hawkes is out of town, and I don't
have the recipe to reproduce what he found.

Signed-off-by: Paul Jackson
Acked-by: Nick Piggin
Signed-off-by: Linus Torvalds

Paul Jackson
2005-08-27 07:38:47 +0800
ca2f3daf7 [PATCH] undo partial cpu_exclusive sched domain disabling ... Browse Code »

The partial disabling of Dinakar's new facility to allow
cpu_exclusive cpusets to define dynamic sched domains
doesn't go far enough. At the suggestion of Nick Piggin
and Dinakar, let us instead totally disable this facility
for 2.6.13, in order to avoid problems first reported
by John Hawkes (corrupt sched data structures and kernel oops).

This patch removes the partial disabling code in 2.6.13-rc7,
in anticipation of the next patch, which will totally disable
it instead.

Signed-off-by: Paul Jackson
Signed-off-by: Linus Torvalds

Paul Jackson
2005-08-27 07:38:46 +0800

25 Aug, 2005

1 commit

3725822f7 [PATCH] cpu_exclusive sched domains build fix ... Browse Code »

As reported by Paul Mackerras , the previous patch
"cpu_exclusive sched domains fix" broke the ppc64 build with
CONFIC_CPUSET, yielding error messages:

kernel/cpuset.c: In function 'update_cpu_domains':
kernel/cpuset.c:648: error: invalid lvalue in unary '&'
kernel/cpuset.c:648: error: invalid lvalue in unary '&'

On some arch's, the node_to_cpumask() is a function, returning
a cpumask_t. But the for_each_cpu_mask() requires an lvalue mask.

The following patch fixes this build failure by making a copy
of the cpumask_t on the stack.

Signed-off-by: Paul Jackson
Signed-off-by: Linus Torvalds

Paul Jackson
2005-08-25 00:40:45 +0800

24 Aug, 2005

2 commits

d10689b68 [PATCH] cpu_exclusive sched domains on partial nodes temp fix ... Browse Code »

This keeps the kernel/cpuset.c routine update_cpu_domains() from
invoking the sched.c routine partition_sched_domains() if the cpuset in
question doesn't fall on node boundaries.

I have boot tested this on an SN2, and with the help of a couple of ad
hoc printk's, determined that it does indeed avoid calling the
partition_sched_domains() routine on partial nodes.

I did not directly verify that this avoids setting up bogus sched
domains or avoids the oops that Hawkes saw.

This patch imposes a silent artificial constraint on which cpusets can
be used to define dynamic sched domains.

This patch should allow proceeding with this new feature in 2.6.13 for
the configurations in which it is useful (node alligned sched domains)
while avoiding trying to setup sched domains in the less useful cases
that can cause the kernel corruption and oops.

Signed-off-by: Paul Jackson
Acked-by: Ingo Molnar
Acked-by: Dinakar Guniguntala
Acked-by: John Hawkes
Signed-off-by: Linus Torvalds

Paul Jackson
2005-08-24 11:02:52 +0800
4c5640cb5 [PATCH] preempt race in getppid ... Browse Code »

With CONFIG_PREEMPT && !CONFIG_SMP, it's possible for sys_getppid to
return a bogus value if the parent's task_struct gets reallocated after
current->group_leader->real_parent is read:

asmlinkage long sys_getppid(void)
{
int pid;
struct task_struct *me = current;
struct task_struct *parent;

parent = me->group_leader->real_parent;
RACE HERE => for (;;) {
pid = parent->tgid;
#ifdef CONFIG_SMP
{
struct task_struct *old = parent;

/*
* Make sure we read the pid before re-reading the
* parent pointer:
*/
smp_rmb();
parent = me->group_leader->real_parent;
if (old != parent)
continue;
}
#endif
break;
}
return pid;
}

If the process gets preempted at the indicated point, the parent process
can go ahead and call exit() and then get wait()'d on to reap its
task_struct. When the preempted process gets resumed, it will not do any
further checks of the parent pointer on !CONFIG_SMP: it will read the
bad pid and return.

So, the same algorithm used when SMP is enabled should be used when
preempt is enabled, which will recheck ->real_parent in this case.

Signed-off-by: David Meybohm
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Meybohm
2005-08-24 02:44:29 +0800

19 Aug, 2005

1 commit

024f47479 [PATCH] Make RLIMIT_NICE ranges consistent with getpriority(2) ... Browse Code »

As suggested by Michael Kerrisk , make RLIMIT_NICE
consistent with getpriority before it becomes available in released glibc.

Signed-off-by: Matt Mackall
Acked-by: Ingo Molnar
Acked-by: Chris Wright
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matt Mackall
2005-08-19 03:53:58 +0800

18 Aug, 2005

1 commit

dd12f48d4 [PATCH] NPTL signal delivery deadlock fix ... Browse Code »

This bug is quite subtle and only happens in a very interesting
situation where a real-time threaded process is in the middle of a
coredump when someone whacks it with a SIGKILL. However, this deadlock
leaves the system pretty hosed and you have to reboot to recover.

Not good for real-time priority-preemption applications like our
telephony application, with 90+ real-time (SCHED_FIFO and SCHED_RR)
processes, many of them multi-threaded, interacting with each other for
high volume call processing.

Acked-by: Roland McGrath
Signed-off-by: Linus Torvalds

Bhavesh P. Davda
2005-08-18 03:52:04 +0800

11 Aug, 2005

1 commit

606867443 [PATCH] remove name length check in a workqueue ... Browse Code »

We have a chek in there to make sure that the name won't overflow
task_struct.comm[], but it's triggering for scsi with lots of HBAs, only
scsi is using single-threaded workqueues which don't append the "/%d"
anyway.

All too hard. Just kill the BUG_ON.

Cc: Ingo Molnar
Signed-off-by: Andrew Morton

[ kthread_create() uses vsnprintf() and limits the thing, so no
actual overflow can actually happen regardless ]

Signed-off-by: Linus Torvalds

James Bottomley
2005-08-11 02:55:19 +0800

10 Aug, 2005

1 commit

3077a260e [PATCH] cpuset release ABBA deadlock fix ... Browse Code »

Fix possible cpuset_sem ABBA deadlock if 'notify_on_release' set.

For a particular usage pattern, creating and destroying cpusets fairly
frequently using notify_on_release, on a very large system, this deadlock
can be seen every few days. If you are not using the cpuset
notify_on_release feature, you will never see this deadlock.

The existing code, on task exit (or cpuset deletion) did:

get cpuset_sem
if cpuset marked notify_on_release and is ready to release:
compute cpuset path relative to /dev/cpuset mount point
call_usermodehelper() forks /sbin/cpuset_release_agent with path
drop cpuset_sem

Unfortunately, the fork in call_usermodehelper can allocate memory, and
allocating memory can require cpuset_sem, if the mems_generation values
changed in the interim. This results in an ABBA deadlock, trying to obtain
cpuset_sem when it is already held by the current task.

To fix this, I put the cpuset path (which must be computed while holding
cpuset_sem) in a temporary buffer, to be used in the call_usermodehelper
call of /sbin/cpuset_release_agent only _after_ dropping cpuset_sem.

So the new logic is:

get cpuset_sem
if cpuset marked notify_on_release and is ready to release:
compute cpuset path relative to /dev/cpuset mount point
stash path in kmalloc'd buffer
drop cpuset_sem
call_usermodehelper() forks /sbin/cpuset_release_agent with path
free path

The sharp eyed reader might notice that this patch does not contain any
calls to kmalloc. The existing code in the check_for_release() routine was
already kmalloc'ing a buffer to hold the cpuset path. In the old code, it
just held the buffer for a few lines, over the cpuset_release_agent() call
that in turn invoked call_usermodehelper(). In the new code, with the
application of this patch, it returns that buffer via the new char
**ppathbuf parameter, for later use and freeing in cpuset_release_agent(),
which is called after cpuset_sem is dropped. Whereas the old code has just
one call to cpuset_release_agent(), right in the check_for_release()
routine, the new code has three calls to cpuset_release_agent(), from the
various places that a cpuset can be released.

This patch has been build and booted on SN2, and passed a stress test that
previously hit the deadlock within a few seconds.

Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2005-08-10 03:08:22 +0800