Eric Lee / smarc-fsl-linux-kernel

24 Sep, 2009

40 commits

b873c2f34 memstick: move dev_dbg ... Browse Code »

id_reg.if_mode might be unitialized when (*mrq)->error is nonzero. move
dev_dbg() inside the if so that we are sure we can use id_reg values.

Signed-off-by: Jiri Slaby
Cc: Alex Dubov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Slaby
2009-09-24 22:21:05 +0800
3886de938 adfs: remove redundant test on unsigned ... Browse Code »

unsigned block cannot be less than 0.

Signed-off-by: Roel Kluin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roel Kluin
2009-09-24 22:21:05 +0800
458e5ff13 edac: core: remove completion-wait for complete with rcu_barrier ... Browse Code »

Module edac_core.ko uses call_rcu() callbacks in edac_device.c, edac_mc.c
and edac_pci.c.

They all use a wait_for_completion() scheme, but this scheme it not 100%
safe on multiple CPUs. See the _rcu_barrier() implementation which
explains why extra precausion is needed.

The patch adds a comment about rcu_barrier() and as a precausion calls
rcu_barrier(). A maintainer needs to look at removing the
wait_for_completion code.

[dougthompson@xmission.com: remove the wait_for_completion code]
Signed-off-by Jesper Dangaard Brouer
Signed-off-by: Doug Thompson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jesper Dangaard Brouer
2009-09-24 22:21:05 +0800
dd8ef1db8 edac: i3200 memory controller driver ... Browse Code »

A driver for the Intel 3200 and 3210 memory controllers. It has only had
light testing so far, and currently makes no attempt to decode error
addresses at anything finer than csrow granularity.

Signed-off-by: Jason Uhlenkott
Signed-off-by: Doug Thompson
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jason Uhlenkott
2009-09-24 22:21:04 +0800
30a61fff3 edac: fix resource size calculation ... Browse Code »

Use the function resource_size, which reduces the chance of introducing
off-by-one errors in calculating the resource size.

The semantic patch that makes this change is as follows:
(http://www.emn.fr/x-info/coccinelle/)

//
@@
struct resource *res;
@@

- (res->end - res->start) + 1
+ resource_size(res)
//

Signed-off-by: Julia Lawall
Signed-off-by: Doug Thompson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Julia Lawall
2009-09-24 22:21:04 +0800
b48462517 edac: mpc85xx add mpc83xx support ... Browse Code »

Add support for the Freescale MPC83xx memory controller to the existing
driver for the Freescale MPC85xx memory controller. The only difference
between the two processors are in the CS_BNDS register parsing code, which
has been changed so it will work on both processors.

The L2 cache controller does not exist on the MPC83xx, but the OF
subsystem will not use the driver if the device is not present in the OF
device tree.

I had to change the nr_pages calculation to make the math work out. I
checked it on my board and did the math by hand for a 64GB 85xx using 64K
pages. In both cases, nr_pages * PAGE_SIZE comes out to the correct
value.

Signed-off-by: Ira W. Snyder
Signed-off-by: Doug Thompson
Cc: Kumar Gala
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ira W. Snyder
2009-09-24 22:21:04 +0800
a014554e6 edac: mpc85xx add P2020DS support ... Browse Code »

Based on Kumar's new compatible types patch, add P2020 into MPC85xx EDAC
compatible lists so that EDAC can recognize P2020 meomry controller and L2
cache controller and export the relevant fields to sysfs.

EDAC MPC85xx DDR3 support is needed if DDR3 memory stick is installed on a
P2020DS board so that EDAC core can recognize DDR3 memory type.

Signed-off-by: Yang Shi
Acked-by: Dave Jiang
Signed-off-by: Doug Thompson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yang Shi
2009-09-24 22:21:04 +0800
9064a6787 linux/futex.h: place kernel types behind __KERNEL__ ... Browse Code »

The forward decls for some kernel types are only needed by the code behind
__KERNEL__, so don't bleed these types to userspace.

Signed-off-by: Mike Frysinger
Cc: Thomas Gleixner
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Frysinger
2009-09-24 22:21:04 +0800
e5a473869 pidns: deny CLONE_PARENT|CLONE_NEWPID combination ... Browse Code »

CLONE_PARENT was used to implement an older threading model. For
consistency with the CLONE_THREAD check in copy_pid_ns(), disable
CLONE_PARENT with CLONE_NEWPID, at least until the required semantics of
pid namespaces are clear.

Signed-off-by: Sukadev Bhattiprolu
Acked-by: Roland McGrath
Acked-by: Serge Hallyn
Cc: Oren Laadan
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sukadev Bhattiprolu
2009-09-24 22:21:04 +0800
123be07b0 fork(): disable CLONE_PARENT for init ... Browse Code »

When global or container-init processes use CLONE_PARENT, they create a
multi-rooted process tree. Besides siblings of global init remain as
zombies on exit since they are not reaped by their parent (swapper). So
prevent global and container-inits from creating siblings.

Signed-off-by: Sukadev Bhattiprolu
Acked-by: Eric W. Biederman
Acked-by: Roland McGrath
Cc: Oren Laadan
Cc: Oleg Nesterov
Cc: Serge Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sukadev Bhattiprolu
2009-09-24 22:21:04 +0800
8d65af789 sysctl: remove "struct file *" argument of ->proc_handler ... Browse Code »

It's unused.

It isn't needed -- read or write flag is already passed and sysctl
shouldn't care about the rest.

It _was_ used in two places at arch/frv for some reason.

Signed-off-by: Alexey Dobriyan
Cc: David Howells
Cc: "Eric W. Biederman"
Cc: Al Viro
Cc: Ralf Baechle
Cc: Martin Schwidefsky
Cc: Ingo Molnar
Cc: "David S. Miller"
Cc: James Morris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2009-09-24 22:21:04 +0800
c0d0787b6 MAINTAINERS: add Matt Mackall and Herbert Xu to HARDWARE RANDOM NUMBER GENERATOR ... Browse Code »

Signed-off-by: Joe Perches
Cc: David Daney
Cc: Herbert Xu
Cc: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2009-09-24 22:21:03 +0800
156dd635e bfin-otp: add writing support ... Browse Code »

The on-chip OTP may be written at runtime, so enable support for it in the
driver. However, since writing should really be done only on development
systems, don't bend over backwards to make sure the simple software lock
is per-fd -- per-device is OK.

Signed-off-by: Mike Frysinger
Signed-off-by: Bryan Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Frysinger
2009-09-24 22:21:03 +0800
fbd8ae106 drivers/char/uv_mmtimer.c: add memory mapped RTC driver for UV ... Browse Code »

This driver memory maps the UV Hub RTC.

Signed-off-by: Dimitri Sivanich
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dimitri Sivanich
2009-09-24 22:21:03 +0800
459ca8b4e drivers/char/rio/rioctrl.c: off by one error in rioctrl.c ... Browse Code »

If DownLoad.ProductCode == MAX_PRODUCT, that would be a problem when we do
RIOBootTable[DownLoad.ProductCode] a couple lines down.

Found by smatch (http://repo.or.cz/w/smatch.git).

Signed-off-by: Dan Carpenter
Cc: Jiri Slaby
Cc: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Carpenter
2009-09-24 22:21:03 +0800
ae21cf924 hpet: hpet driver periodic timer setup bug fixes ... Browse Code »

The periodic interrupt from drivers/char/hpet.c does not work correctly,
both when using the periodic capability of the hardware and while
emulating the periodic interrupt (when hardware does not support periodic
mode).

With timers capable of periodic interrupts, the comparator field is first
set with the period value followed by set of hidden accumulator, which has
the side effect of overwriting the comparator value. This results in
wrong periodicity for the interrupts. For, periodic interrupts to work,
following steps are necessary, in that order.

* Set config with Tn_VAL_SET_CNF bit

* Write to hidden accumulator, the value written is the time when the
first interrupt should be generated

* Write compartor with period interval for subsequent interrupts
(http://www.intel.com/hardwaredesign/hpetspec_1.pdf )

When emulating periodic timer with timers not capable of periodic
interrupt, driver is adding the period to counter value instead of
comparator value, which causes slow drift when using this emulation.

Also, driver seems to add hpetp->hp_delta both while setting up periodic
interrupt and while emulating periodic interrupts with timers not capable
of doing periodic interrupts. This hp_delta will result in slower than
expected interrupt rate and should not be used while setting the interval.

Signed-off-by: Venkatesh Pallipadi
Signed-off-by: Nils Carlson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nils Carlson
2009-09-24 22:21:03 +0800
dc80df567 mwave: fix read buffer overflow ... Browse Code »

Check whether index is within bounds before grabbing the element.

Signed-off-by: Roel Kluin
Cc: Kay Sievers
Cc: Greg Kroah-Hartman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roel Kluin
2009-09-24 22:21:03 +0800
dd5d81f32 fs/char_dev.c: remove useless loop ... Browse Code »

There are two useless lines in fs/char_dev.c.

In register_chrdev there is a loop to change all '/' into '!' in the
kernel object name.
This code is useless as the same substitution is in kobject_set_name_vargs in
lib/kobject.c:
228 /* ewww... some of these buggers have '/' in the name ... */
229 while ((s = strchr(kobj->name, '/')))
230 s[0] = '!';

kobject_set_name_vargs is called by kobject_set_name.
kobject_set_name is called just above the useless loop.

[hidave.darkstar@gmail.com: fix warning, remove the unused char *s]
Signed-off-by: Renzo Davoli
Cc: Al Viro
Signed-off-by: Dave Young
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Renzo Davoli
2009-09-24 22:21:03 +0800
bb521c5de /dev/zero: avoid repeated access_ok() checks ... Browse Code »

In read_zero, we check for access_ok() once for the count bytes. It is
unnecessarily checked again in clear_user. Use __clear_user, which does
not check for access_ok().

Signed-off-by: Nikanth Karthikesan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nikanth Karthikesan
2009-09-24 22:21:03 +0800
0b8c78f2b flat: use IS_ERR_VALUE() helper macro ... Browse Code »

There is a common macro now for testing mixed pointer/errno values, so use
that rather than handling the casts ourself.

Signed-off-by: Mike Frysinger
Acked-by: David McCullough
Acked-by: Greg Ungerer
Cc: David Howells
Cc: Paul Mundt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Frysinger
2009-09-24 22:21:03 +0800
8e8b63a68 fdpic: ignore the loader's PT_GNU_STACK when calculating the stack size ... Browse Code »

Ignore the loader's PT_GNU_STACK when calculating the stack size, and only
consider the executable's PT_GNU_STACK, assuming the executable has one.

Currently the behaviour is to take the largest stack size and use that,
but that means you can't reduce the stack size in the executable. The
loader's stack size should probably only be used when executing the loader
directly.

WARNING: This patch is slightly dangerous - it may render a system
inoperable if the loader's stack size is larger than that of important
executables, and the system relies unknowingly on this increasing the size
of the stack.

Signed-off-by: David Howells
Signed-off-by: Mike Frysinger
Acked-by: Paul Mundt
Cc: Pavel Machek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Howells
2009-09-24 22:21:02 +0800
0cf062d0f elf: clean up fill_note_info() ... Browse Code »

Introduce a helper function elf_note_info_init() to help fill_note_info()
to do initializations, also fix the potential memory leaks.

[akpm@linux-foundation.org: remove NUM_NOTES]
Signed-off-by: WANG Cong
Cc: Alexander Viro
Cc: David Howells
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Amerigo Wang
2009-09-24 22:21:01 +0800
d9588725e signals: inline __fatal_signal_pending ... Browse Code »

__fatal_signal_pending inlines to one instruction on x86, probably two
instructions on other machines. It takes two longer x86 instructions just
to call it and test its return value, not to mention the function itself.

On my random x86_64 config, this saved 70 bytes of text (59 of those being
__fatal_signal_pending itself).

Signed-off-by: Roland McGrath
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roland McGrath
2009-09-24 22:21:01 +0800
ba0a6c9f6 fcntl: add F_[SG]ETOWN_EX ... Browse Code »

In order to direct the SIGIO signal to a particular thread of a
multi-threaded application we cannot, like suggested by the manpage, put a
TID into the regular fcntl(F_SETOWN) call. It will still be send to the
whole process of which that thread is part.

Since people do want to properly direct SIGIO we introduce F_SETOWN_EX.

The need to direct SIGIO comes from self-monitoring profiling such as with
perf-counters. Perf-counters uses SIGIO to notify that new sample data is
available. If the signal is delivered to the same task that generated the
new sample it can augment that data by inspecting the task's user-space
state right after it returns from the kernel. This is esp. convenient
for interpreted or virtual machine driven environments.

Both F_SETOWN_EX and F_GETOWN_EX take a pointer to a struct f_owner_ex
as argument:

struct f_owner_ex {
int type;
pid_t pid;
};

Where type is one of F_OWNER_TID, F_OWNER_PID or F_OWNER_GID.

Signed-off-by: Peter Zijlstra
Reviewed-by: Oleg Nesterov
Tested-by: stephane eranian
Cc: Michael Kerrisk
Cc: Roland McGrath
Cc: Al Viro
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2009-09-24 22:21:01 +0800
06f1631a1 signals: send_sigio: use do_send_sig_info() to avoid check_kill_permission() ... Browse Code »

group_send_sig_info()->check_kill_permission() assumes that current is the
sender and uses current_cred().

This is not true in send_sigio_to_task() case. From the security pov the
sender is not current, but the task which did fcntl(F_SETOWN), that is why
we have sigio_perm() which uses the right creds to check.

Fortunately, send_sigio() always sends either SEND_SIG_PRIV or
SI_FROMKERNEL() signal, so check_kill_permission() does nothing. But
still it would be tidier to avoid this bogus security check and save a
couple of cycles.

Signed-off-by: Oleg Nesterov
Cc: Peter Zijlstra
Cc: stephane eranian
Cc: Ingo Molnar
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2009-09-24 22:21:01 +0800
4a30debfb signals: introduce do_send_sig_info() helper ... Browse Code »

Introduce do_send_sig_info() and convert group_send_sig_info(),
send_sig_info(), do_send_specific() to use this helper.

Hopefully it will have more users soon, it allows to specify
specific/group behaviour via "bool group" argument.

Shaves 80 bytes from .text.

Signed-off-by: Oleg Nesterov
Cc: Peter Zijlstra
Cc: stephane eranian
Cc: Ingo Molnar
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2009-09-24 22:21:01 +0800
964ee7df9 exec: fix set_binfmt() vs sys_delete_module() race ... Browse Code »

sys_delete_module() can set MODULE_STATE_GOING after
search_binary_handler() does try_module_get(). In this case
set_binfmt()->try_module_get() fails but since none of the callers
check the returned error, the task will run with the wrong old
->binfmt.

The proper fix should change all ->load_binary() methods, but we can
rely on fact that the caller must hold a reference to binfmt->module
and use __module_get() which never fails.

Signed-off-by: Oleg Nesterov
Acked-by: Rusty Russell
Cc: Hiroshi Shimamoto
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2009-09-24 22:21:01 +0800
61be228a0 exec: allow do_coredump() to wait for user space pipe readers to complete ... Browse Code »

Allow core_pattern pipes to wait for user space to complete

One of the things that user space processes like to do is look at metadata
for a crashing process in their /proc/ directory. this is racy
however, since do_coredump in the kernel doesn't wait for the user space
process to complete before it reaps the crashing process. This patch
corrects that. Allowing the kernel to wait for the user space process to
complete before cleaning up the crashing process. This is a bit tricky to
do for a few reasons:

1) The user space process isn't our child, so we can't sys_wait4 on it
2) We need to close the pipe before waiting for the user process to complete,
since the user process may rely on an EOF condition

I've discussed several solutions with Oleg Nesterov off-list about this,
and this is the one we've come up with. We add ourselves as a pipe reader
(to prevent premature cleanup of the pipe_inode_info), and remove
ourselves as a writer (to provide an EOF condition to the writer in user
space), then we iterate until the user space process exits (which we
detect by pipe->readers == 1, hence the > 1 check in the loop). When we
exit the loop, we restore the proper reader/writer values, then we return
and let filp_close in do_coredump clean up the pipe data properly.

Signed-off-by: Neil Horman
Reported-by: Earl Chew
Cc: Oleg Nesterov
Cc: Andi Kleen
Cc: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Neil Horman
2009-09-24 22:21:00 +0800
a293980c2 exec: let do_coredump() limit the number of concurrent dumps to pipes ... Browse Code »

Introduce core pipe limiting sysctl.

Since we can dump cores to pipe, rather than directly to the filesystem,
we create a condition in which a user can create a very high load on the
system simply by running bad applications.

If the pipe reader specified in core_pattern is poorly written, we can
have lots of ourstandig resources and processes in the system.

This sysctl introduces an ability to limit that resource consumption.
core_pipe_limit defines how many in-flight dumps may be run in parallel,
dumps beyond this value are skipped and a note is made in the kernel log.
A special value of 0 in core_pipe_limit denotes unlimited core dumps may
be handled (this is the default value).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Neil Horman
Reported-by: Earl Chew
Cc: Oleg Nesterov
Cc: Andi Kleen
Cc: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Neil Horman
2009-09-24 22:21:00 +0800
725eae32d exec: make do_coredump() more resilient to recursive crashes ... Browse Code »

Change how we detect recursive dumps.

Currently we have a mechanism by which we try to compare pathnames of the
crashing process to the core_pattern path. This is broken for a dozen
reasons, and just doesn't work in any sort of robust way.

I'm replacing it with the use of a 0 RLIMIT_CORE value. Since helper apps
set RLIMIT_CORE to zero, we don't write out core files for any process
with that particular limit set. It the core_pattern is a pipe, any
non-zero limit is translated to RLIM_INFINITY.

This allows complete dumps to be captured, but prevents infinite recursion
in the event that the core_pattern process itself crashes.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Neil Horman
Reported-by: Earl Chew
Cc: Oleg Nesterov
Cc: Andi Kleen
Cc: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Neil Horman
2009-09-24 22:21:00 +0800
ae6d2ed7b signals: tracehook_notify_jctl change ... Browse Code »

This changes tracehook_notify_jctl() so it's called with the siglock held,
and changes its argument and return value definition. These clean-ups
make it a better fit for what new tracing hooks need to check.

Tracing needs the siglock here, held from the time TASK_STOPPED was set,
to avoid potential SIGCONT races if it wants to allow any blocking in its
tracing hooks.

This also folds the finish_stop() function into its caller
do_signal_stop(). The function is short, called only once and only
unconditionally. It aids readability to fold it in.

[oleg@redhat.com: do not call tracehook_notify_jctl() in TASK_STOPPED state]
[oleg@redhat.com: introduce tracehook_finish_jctl() helper]
Signed-off-by: Roland McGrath
Signed-off-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roland McGrath
2009-09-24 22:21:00 +0800
b6fe2d117 wait_noreap_copyout(): check for ->wo_info != NULL ... Browse Code »

Current behaviour of sys_waitid() looks odd. If user passes infop ==
NULL, sys_waitid() returns success. When user additionally specifies flag
WNOWAIT, sys_waitid() returns -EFAULT on the same conditions. When user
combines WNOWAIT with WCONTINUED, sys_waitid() again returns success.

This patch adds check for ->wo_info in wait_noreap_copyout().

User-visible change: starting from this commit, sys_waitid() always checks
infop != NULL and does not fail if it is NULL.

Signed-off-by: Vitaly Mayatskikh
Reviewed-by: Oleg Nesterov
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vitaly Mayatskikh
2009-09-24 22:21:00 +0800
dfe16dfa4 do_wait: fix sys_waitid()-specific behaviour ... Browse Code »

do_wait() checks ->wo_info to figure out who is the caller. If it's not
NULL the caller should be sys_waitid(), in that case do_wait() fixes up
the retval or zeros ->wo_info, depending on retval from underlying
function.

This is bug: user can pass ->wo_info == NULL and sys_waitid() will return
incorrect value.

man 2 waitid says:

waitid(): returns 0 on success

Test-case:

int main(void)
{
if (fork())
assert(waitid(P_ALL, 0, NULL, WEXITED) == 0);

return 0;
}

Result:

Assertion `waitid(P_ALL, 0, ((void *)0), 4) == 0' failed.

Move that code to sys_waitid().

User-visible change: sys_waitid() will return 0 on success, either
infop is set or not.

Note, there's another bug in wait_noreap_copyout() which affects
return value of sys_waitid(). It will be fixed in next patch.

Signed-off-by: Vitaly Mayatskikh
Reviewed-by: Oleg Nesterov
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vitaly Mayatskikh
2009-09-24 22:21:00 +0800
b6e763f07 wait_consider_task: kill "parent" argument ... Browse Code »

Kill the unused "parent" argument in wait_consider_task(), it was never used.

Signed-off-by: Oleg Nesterov
Cc: Roland McGrath
Cc: Ingo Molnar
Cc: Ratan Nalumasu
Cc: Vitaly Mayatskikh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2009-09-24 22:21:00 +0800
989264f46 do_wait-wakeup-optimization: simplify task_pid_type() ... Browse Code »

task_pid_type() is only used by eligible_pid() which has to check wo_type
!= PIDTYPE_MAX anyway. Remove this check from task_pid_type() and factor
out ->pids[type] access, this shrinks .text a bit and simplifies the code.

The matches the behaviour of other similar helpers, say get_task_pid().
The caller must ensure that pid_type is valid, not the callee.

Signed-off-by: Oleg Nesterov
Cc: Roland McGrath
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2009-09-24 22:21:00 +0800
5c01ba49e do_wait-wakeup-optimization: fix child_wait_callback()->eligible_child() usage ... Browse Code »

child_wait_callback()->eligible_child() is not right, we can miss the
wakeup if the task was detached before __wake_up_parent() and the caller
of do_wait() didn't use __WALL.

Move ->wo_pid checks from eligible_child() to the new helper,
eligible_pid(), and change child_wait_callback() to use it instead of
eligible_child().

Note: actually I think it would be better to fix the __WCLONE check in
eligible_child(), it doesn't look exactly right. But it is not clear what
is the supposed behaviour, and any change is user-visible.

Reported-by: KAMEZAWA Hiroyuki
Tested-by: KAMEZAWA Hiroyuki
Signed-off-by: Oleg Nesterov
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2009-09-24 22:21:00 +0800
b4fe51823 do_wait() wakeup optimization: child_wait_callback: check __WNOTHREAD case ... Browse Code »

Suggested by Roland.

do_wait(__WNOTHREAD) can only succeed if the caller is either ptracer, or
it is ->real_parent and the child is not traced. IOW, caller == p->parent
otherwise we should not wake up.

Change child_wait_callback() to check this. Ratan reports the workload with
CPU load >99% caused by unnecessary wakeups, should be fixed by this patch.

Signed-off-by: Oleg Nesterov
Acked-by: Roland McGrath
Cc: Ingo Molnar
Cc: Ratan Nalumasu
Cc: Vitaly Mayatskikh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2009-09-24 22:21:00 +0800
0b7570e77 do_wait() wakeup optimization: change __wake_up_parent() to use filtered wakeup ... Browse Code »

Ratan Nalumasu reported that in a process with many threads doing
unnecessary wakeups. Every waiting thread in the process wakes up to loop
through the children and see that the only ones it cares about are still
not ready.

Now that we have struct wait_opts we can change do_wait/__wake_up_parent
to use filtered wakeups.

We can make child_wait_callback() more clever later, right now it only
checks eligible_child().

Signed-off-by: Oleg Nesterov
Acked-by: Roland McGrath
Cc: Ingo Molnar
Cc: Ratan Nalumasu
Cc: Vitaly Mayatskikh
Acked-by: James Morris
Tested-by: Valdis Kletnieks
Acked-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2009-09-24 22:20:59 +0800
a2322e1d2 do_wait() wakeup optimization: shift security_task_wait() from eligible_child() … ... Browse Code »

…to wait_consider_task()

Preparation, no functional changes.

eligible_child() has a single caller, wait_consider_task(). We can move
security_task_wait() out from eligible_child(), this allows us to use it
for filtered wake_up().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ratan Nalumasu <rnalumasu@gmail.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Oleg Nesterov
2009-09-24 22:20:59 +0800
a7f0765ed ptrace: __ptrace_detach: do __wake_up_parent() if we reap the tracee ... Browse Code »

The bug is old, it wasn't cause by recent changes.

Test case:

static void *tfunc(void *arg)
{
int pid = (long)arg;

assert(ptrace(PTRACE_ATTACH, pid, NULL, NULL) == 0);
kill(pid, SIGKILL);

sleep(1);
return NULL;
}

int main(void)
{
pthread_t th;
long pid = fork();

if (!pid)
pause();

signal(SIGCHLD, SIG_IGN);
assert(pthread_create(&th, NULL, tfunc, (void*)pid) == 0);

int r = waitpid(-1, NULL, __WNOTHREAD);
printf("waitpid: %d %m\n", r);

return 0;
}

Before the patch this program hangs, after this patch waitpid() correctly
fails with errno == -ECHILD.

The problem is, __ptrace_detach() reaps the EXIT_ZOMBIE tracee if its
->real_parent is our sub-thread and we ignore SIGCHLD. But in this case
we should wake up other threads which can sleep in do_wait().

Signed-off-by: Oleg Nesterov
Cc: Roland McGrath
Cc: Vitaly Mayatskikh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2009-09-24 22:20:59 +0800