01 Nov, 2011
1 commit
-
This removes mm->oom_disable_count entirely since it's unnecessary and
currently buggy. The counter was intended to be per-process but it's
currently decremented in the exit path for each thread that exits, causing
it to underflow.The count was originally intended to prevent oom killing threads that
share memory with threads that cannot be killed since it doesn't lead to
future memory freeing. The counter could be fixed to represent all
threads sharing the same mm, but it's better to remove the count since:- it is possible that the OOM_DISABLE thread sharing memory with the
victim is waiting on that thread to exit and will actually cause
future memory freeing, and- there is no guarantee that a thread is disabled from oom killing just
because another thread sharing its mm is oom disabled.Signed-off-by: David Rientjes
Reported-by: Oleg Nesterov
Reviewed-by: Oleg Nesterov
Cc: Ying Han
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
27 Jul, 2011
1 commit
-
Add support for the shm_rmid_forced sysctl. If set to 1, all shared
memory objects in current ipc namespace will be automatically forced to
use IPC_RMID.The POSIX way of handling shmem allows one to create shm objects and
call shmdt(), leaving shm object associated with no process, thus
consuming memory not counted via rlimits.With shm_rmid_forced=1 the shared memory object is counted at least for
one process, so OOM killer may effectively kill the fat process holding
the shared memory.It obviously breaks POSIX - some programs relying on the feature would
stop working. So set shm_rmid_forced=1 only if you're sure nobody uses
"orphaned" memory. Use shm_rmid_forced=0 by default for compatability
reasons.The feature was previously impemented in -ow as a configure option.
[akpm@linux-foundation.org: fix documentation, per Randy]
[akpm@linux-foundation.org: fix warning]
[akpm@linux-foundation.org: readability/conventionality tweaks]
[akpm@linux-foundation.org: fix shm_rmid_forced/shm_forced_rmid confusion, use standard comment layout]
Signed-off-by: Vasiliy Kulikov
Cc: Randy Dunlap
Cc: "Eric W. Biederman"
Cc: "Serge E. Hallyn"
Cc: Daniel Lezcano
Cc: Oleg Nesterov
Cc: Tejun Heo
Cc: Ingo Molnar
Cc: Alan Cox
Cc: Solar Designer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
26 Jul, 2011
2 commits
-
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
fs: Merge split strings
treewide: fix potentially dangerous trailing ';' in #defined values/expressions
uwb: Fix misspelling of neighbourhood in comment
net, netfilter: Remove redundant goto in ebt_ulog_packet
trivial: don't touch files that are removed in the staging tree
lib/vsprintf: replace link to Draft by final RFC number
doc: Kconfig: `to be' -> `be'
doc: Kconfig: Typo: square -> squared
doc: Konfig: Documentation/power/{pm => apm-acpi}.txt
drivers/net: static should be at beginning of declaration
drivers/media: static should be at beginning of declaration
drivers/i2c: static should be at beginning of declaration
XTENSA: static should be at beginning of declaration
SH: static should be at beginning of declaration
MIPS: static should be at beginning of declaration
ARM: static should be at beginning of declaration
rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check
Update my e-mail address
PCIe ASPM: forcedly -> forcibly
gma500: push through device driver tree
...Fix up trivial conflicts:
- arch/arm/mach-ep93xx/dma-m2p.c (deleted)
- drivers/gpio/gpio-ep93xx.c (renamed and context nearby)
- drivers/net/r8169.c (just context changes) -
* 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits)
block: strict rq_affinity
backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu
block: fix patch import error in max_discard_sectors check
block: reorder request_queue to remove 64 bit alignment padding
CFQ: add think time check for group
CFQ: add think time check for service tree
CFQ: move think time check variables to a separate struct
fixlet: Remove fs_excl from struct task.
cfq: Remove special treatment for metadata rqs.
block: document blk_plug list access
block: avoid building too big plug list
compat_ioctl: fix make headers_check regression
block: eliminate potential for infinite loop in blkdev_issue_discard
compat_ioctl: fix warning caused by qemu
block: flush MEDIA_CHANGE from drivers on close(2)
blk-throttle: Make total_nr_queued unsigned
block: Add __attribute__((format(printf...) and fix fallout
fs/partitions/check.c: make local symbols static
block:remove some spare spaces in genhd.c
block:fix the comment error in blkdev.h
...
23 Jul, 2011
1 commit
-
* 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc: (39 commits)
ptrace: do_wait(traced_leader_killed_by_mt_exec) can block forever
ptrace: fix ptrace_signal() && STOP_DEQUEUED interaction
connector: add an event for monitoring process tracers
ptrace: dont send SIGSTOP on auto-attach if PT_SEIZED
ptrace: mv send-SIGSTOP from do_fork() to ptrace_init_task()
ptrace_init_task: initialize child->jobctl explicitly
has_stopped_jobs: s/task_is_stopped/SIGNAL_STOP_STOPPED/
ptrace: make former thread ID available via PTRACE_GETEVENTMSG after PTRACE_EVENT_EXEC stop
ptrace: wait_consider_task: s/same_thread_group/ptrace_reparented/
ptrace: kill real_parent_is_ptracer() in in favor of ptrace_reparented()
ptrace: ptrace_reparented() should check same_thread_group()
redefine thread_group_leader() as exit_signal >= 0
do not change dead_task->exit_signal
kill task_detached()
reparent_leader: check EXIT_DEAD instead of task_detached()
make do_notify_parent() __must_check, update the callers
__ptrace_detach: avoid task_detached(), check do_notify_parent()
kill tracehook_notify_death()
make do_notify_parent() return bool
ptrace: s/tracehook_tracer_task()/ptrace_parent()/
...
18 Jul, 2011
1 commit
-
has_stopped_jobs() naively checks task_is_stopped(group_leader). This
was always wrong even without ptrace, group_leader can be dead. And
given that ptrace can change the state to TRACED this is wrong even
in the single-threaded case.Change the code to check SIGNAL_STOP_STOPPED and simplify the code,
retval + break/continue doesn't make this trivial code more readable.We could probably add the usual "|| signal->group_stop_count" check
but I don't think this makes sense, the task can start the group-stop
right after the check anyway.Signed-off-by: Oleg Nesterov
Acked-by: Tejun Heo
12 Jul, 2011
1 commit
-
fs_excl is a poor man's priority inheritance for filesystems to hint to
the block layer that an operation is important. It was never clearly
specified, not widely adopted, and will not prevent starvation in many
cases (like across cgroups).fs_excl was introduced with the time sliced CFQ IO scheduler, to
indicate when a process held FS exclusive resources and thus needed
a boost.It doesn't cover all file systems, and it was never fully complete.
Lets kill it.Signed-off-by: Justin TerAvest
Signed-off-by: Jens Axboe
11 Jul, 2011
1 commit
-
Sync with Linus' tree to be able to apply pending patches that
are based on newer code already present upstream.
09 Jul, 2011
1 commit
-
Since ca5ecddf (rcu: define __rcu address space modifier for sparse)
rcu_dereference_check use rcu_read_lock_held as a part of condition
automatically so callers do not have to do that as well.Signed-off-by: Michal Hocko
Acked-by: Paul E. McKenney
Signed-off-by: Jiri Kosina
28 Jun, 2011
6 commits
-
wait_consider_task() checks same_thread_group(parent, real_parent),
this is the open-coded ptrace_reparented().__ptrace_detach() remains the only function which has to check this by
hand, although we could reorganize the code to delay __ptrace_unlink.Signed-off-by: Oleg Nesterov
Acked-by: Tejun Heo -
Upadate the last user of task_detached(), wait_task_zombie(), to
use thread_group_leader() and kill task_detached().Signed-off-by: Oleg Nesterov
Reviewed-by: Tejun Heo -
Change reparent_leader() to check ->exit_state instead of ->exit_signal,
this matches the similar EXIT_DEAD check in wait_consider_task() and
allows us to cleanup the do_notify_parent/task_detached logic.task_detached() was really needed during reparenting before 9cd80bbb
"do_wait() optimization: do not place sub-threads on ->children list"
to filter out the sub-threads. After this change task_detached(p) can
only be true if p is the dead group_leader and its parent ignores
SIGCHLD, in this case the caller of do_notify_parent() is going to
reap this task and it should set EXIT_DEAD.Signed-off-by: Oleg Nesterov
Reviewed-by: Tejun Heo -
Change other callers of do_notify_parent() to check the value it
returns, this makes the subsequent task_detached() unnecessary.
Mark do_notify_parent() as __must_check.Use thread_group_leader() instead of !task_detached() to check
if we need to notify the real parent in wait_task_zombie().Remove the stale comment in release_task(). "just for sanity" is
no longer true, we have to set EXIT_DEAD to avoid the races with
do_wait().Signed-off-by: Oleg Nesterov
Acked-by: Tejun Heo -
Kill tracehook_notify_death(), reimplement the logic in its caller,
exit_notify().Also, change the exec_id's check to use thread_group_leader() instead
of task_detached(), this is more clear. This logic only applies to
the exiting leader, a sub-thread must never change its exit_signal.Note: when the traced group leader exits the exit_signal-or-SIGCHLD
logic looks really strange:- we notify the tracer even if !thread_group_empty() but
do_wait(WEXITED) can't work until all threads exit- if the tracer is real_parent, it is not clear why can't
we use ->exit_signal event if !thread_group_empty()-v2: do not try to fix the 2nd oddity to avoid the subtle behavior
change mixed with reorganization, suggested by Tejun.Signed-off-by: Oleg Nesterov
Reviewed-by: Tejun Heo -
- change do_notify_parent() to return a boolean, true if the task should
be reaped because its parent ignores SIGCHLD.- update the only caller which checks the returned value, exit_notify().
This temporary uglifies exit_notify() even more, will be cleanuped by
the next change.Signed-off-by: Oleg Nesterov
Acked-by: Tejun Heo
23 Jun, 2011
2 commits
-
At this point, tracehooks aren't useful to mainline kernel and mostly
just add an extra layer of obfuscation. Although they have comments,
without actual in-kernel users, it is difficult to tell what are their
assumptions and they're actually trying to achieve. To mainline
kernel, they just aren't worth keeping around.This patch kills the following trivial tracehooks.
* Ones testing whether task is ptraced. Replace with ->ptrace test.
tracehook_expect_breakpoints()
tracehook_consider_ignored_signal()
tracehook_consider_fatal_signal()* ptrace_event() wrappers. Call directly.
tracehook_report_exec()
tracehook_report_exit()
tracehook_report_vfork_done()* ptrace_release_task() wrapper. Call directly.
tracehook_finish_release_task()
* noop
tracehook_prepare_release_task()
tracehook_report_death()This doesn't introduce any behavior change.
Signed-off-by: Tejun Heo
Cc: Christoph Hellwig
Cc: Martin Schwidefsky
Signed-off-by: Oleg Nesterov -
task_ptrace(task) simply dereferences task->ptrace and isn't even used
consistently only adding confusion. Kill it and directly access
->ptrace instead.This doesn't introduce any behavior change.
Signed-off-by: Tejun Heo
Signed-off-by: Oleg Nesterov
17 Jun, 2011
1 commit
-
The previous patch implemented async notification for ptrace but it
only worked while trace is running. This patch introduces
PTRACE_LISTEN which is suggested by Oleg Nestrov.It's allowed iff tracee is in STOP trap and puts tracee into
quasi-running state - tracee never really runs but wait(2) and
ptrace(2) consider it to be running. While ptracer is listening,
tracee is allowed to re-enter STOP to notify an async event.
Listening state is cleared on the first notification. Ptracer can
also clear it by issuing INTERRUPT - tracee will re-trap into STOP
with listening state cleared.This allows ptracer to monitor group stop state without running tracee
- use INTERRUPT to put tracee into STOP trap, issue LISTEN and then
wait(2) to wait for the next group stop event. When it happens,
PTRACE_GETSIGINFO provides information to determine the current state.Test program follows.
#define PTRACE_SEIZE 0x4206
#define PTRACE_INTERRUPT 0x4207
#define PTRACE_LISTEN 0x4208#define PTRACE_SEIZE_DEVEL 0x80000000
static const struct timespec ts1s = { .tv_sec = 1 };
int main(int argc, char **argv)
{
pid_t tracee, tracer;
int i;tracee = fork();
if (!tracee)
while (1)
pause();tracer = fork();
if (!tracer) {
siginfo_t si;ptrace(PTRACE_SEIZE, tracee, NULL,
(void *)(unsigned long)PTRACE_SEIZE_DEVEL);
ptrace(PTRACE_INTERRUPT, tracee, NULL, NULL);
repeat:
waitid(P_PID, tracee, NULL, WSTOPPED);ptrace(PTRACE_GETSIGINFO, tracee, NULL, &si);
if (!si.si_code) {
printf("tracer: SIG %d\n", si.si_signo);
ptrace(PTRACE_CONT, tracee, NULL,
(void *)(unsigned long)si.si_signo);
goto repeat;
}
printf("tracer: stopped=%d signo=%d\n",
si.si_signo != SIGTRAP, si.si_signo);
if (si.si_signo != SIGTRAP)
ptrace(PTRACE_LISTEN, tracee, NULL, NULL);
else
ptrace(PTRACE_CONT, tracee, NULL, NULL);
goto repeat;
}for (i = 0; i < 3; i++) {
nanosleep(&ts1s, NULL);
printf("mother: SIGSTOP\n");
kill(tracee, SIGSTOP);
nanosleep(&ts1s, NULL);
printf("mother: SIGCONT\n");
kill(tracee, SIGCONT);
}
nanosleep(&ts1s, NULL);kill(tracer, SIGKILL);
kill(tracee, SIGKILL);
return 0;
}This is identical to the program to test TRAP_NOTIFY except that
tracee is PTRACE_LISTEN'd instead of PTRACE_CONT'd when group stopped.
This allows ptracer to monitor when group stop ends without running
tracee.# ./test-listen
tracer: stopped=0 signo=5
mother: SIGSTOP
tracer: SIG 19
tracer: stopped=1 signo=19
mother: SIGCONT
tracer: stopped=0 signo=5
tracer: SIG 18
mother: SIGSTOP
tracer: SIG 19
tracer: stopped=1 signo=19
mother: SIGCONT
tracer: stopped=0 signo=5
tracer: SIG 18
mother: SIGSTOP
tracer: SIG 19
tracer: stopped=1 signo=19
mother: SIGCONT
tracer: stopped=0 signo=5
tracer: SIG 18-v2: Moved JOBCTL_LISTENING check in wait_task_stopped() into
task_stopped_code() as suggested by Oleg.Signed-off-by: Tejun Heo
Cc: Oleg Nesterov
16 Jun, 2011
1 commit
-
The following crash was reported:
> Call Trace:
> [] mem_cgroup_from_task+0x15/0x17
> [] __mem_cgroup_try_charge+0x148/0x4b4
> [] ? need_resched+0x23/0x2d
> [] ? preempt_schedule+0x46/0x4f
> [] mem_cgroup_charge_common+0x9a/0xce
> [] mem_cgroup_newpage_charge+0x5d/0x5f
> [] khugepaged+0x5da/0xfaf
> [] ? __init_waitqueue_head+0x4b/0x4b
> [] ? add_mm_counter.constprop.5+0x13/0x13
> [] kthread+0xa8/0xb0
> [] ? sub_preempt_count+0xa1/0xb4
> [] kernel_thread_helper+0x4/0x10
> [] ? retint_restore_args+0x13/0x13
> [] ? __init_kthread_worker+0x5a/0x5aWhat happens is that khugepaged tries to charge a huge page against an mm
whose last possible owner has already exited, and the memory controller
crashes when the stale mm->owner is used to look up the cgroup to charge.mm->owner has never been set to NULL with the last owner going away, but
nobody cared until khugepaged came along.Even then it wasn't a problem because the final mmput() on an mm was
forced to acquire and release mmap_sem in write-mode, preventing an
exiting owner to go away while the mmap_sem was held, and until "692e0b3
mm: thp: optimize memcg charge in khugepaged", the memory cgroup charge
was protected by mmap_sem in read-mode.Instead of going back to relying on the mmap_sem to enforce lifetime of a
task, this patch ensures that mm->owner is properly set to NULL when the
last possible owner is exiting, which the memory controller can handle
just fine.[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Hugh Dickins
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Johannes Weiner
Reported-by: Hugh Dickins
Reported-by: Dave Jones
Reviewed-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
21 May, 2011
1 commit
-
* 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc: (41 commits)
signal: trivial, fix the "timespec declared inside parameter list" warning
job control: reorganize wait_task_stopped()
ptrace: fix signal->wait_chldexit usage in task_clear_group_stop_trapping()
signal: sys_sigprocmask() needs retarget_shared_pending()
signal: cleanup sys_sigprocmask()
signal: rename signandsets() to sigandnsets()
signal: do_sigtimedwait() needs retarget_shared_pending()
signal: introduce do_sigtimedwait() to factor out compat/native code
signal: sys_rt_sigtimedwait: simplify the timeout logic
signal: cleanup sys_rt_sigprocmask()
x86: signal: sys_rt_sigreturn() should use set_current_blocked()
x86: signal: handle_signal() should use set_current_blocked()
signal: sigprocmask() should do retarget_shared_pending()
signal: sigprocmask: narrow the scope of ->siglock
signal: retarget_shared_pending: optimize while_each_thread() loop
signal: retarget_shared_pending: consider shared/unblocked signals only
signal: introduce retarget_shared_pending()
ptrace: ptrace_check_attach() should not do s/STOPPED/TRACED/
signal: Turn SIGNAL_STOP_DEQUEUED into GROUP_STOP_DEQUEUED
signal: do_signal_stop: Remove the unneeded task_clear_group_stop_pending()
...
14 May, 2011
1 commit
-
wait_task_stopped() tested task_stopped_code() without acquiring
siglock and, if stop condition existed, called wait_task_stopped() and
directly returned the result. This patch moves the initial
task_stopped_code() testing into wait_task_stopped() and make
wait_consider_task() fall through to wait_task_continue() on 0 return.This is for the following two reasons.
* Because the initial task_stopped_code() test is done without
acquiring siglock, it may race against SIGCONT generation. The
stopped condition might have been replaced by continued state by the
time wait_task_stopped() acquired siglock. This may lead to
unexpected failure of WNOHANG waits.This reorganization addresses this single race case but there are
other cases - TASK_RUNNING -> TASK_STOPPED transition and EXIT_*
transitions.* Scheduled ptrace updates require changes to the initial test which
would fit better inside wait_task_stopped().Signed-off-by: Tejun Heo
Reviewed-by: Oleg Nesterov
Signed-off-by: Oleg Nesterov
25 Apr, 2011
1 commit
-
When a task is traced and is in a stopped state, the tracer
may execute a ptrace request to examine the tracee state and
get its task struct. Right after, the tracee can be killed
and thus its breakpoints released.
This can happen concurrently when the tracer is in the middle
of reading or modifying these breakpoints, leading to dereferencing
a freed pointer.Hence, to prepare the fix, create a generic breakpoint reference
holding API. When a reference on the breakpoints of a task is
held, the breakpoints won't be released until the last reference
is dropped. After that, no more ptrace request on the task's
breakpoints can be serviced for the tracer.Reported-by: Oleg Nesterov
Signed-off-by: Frederic Weisbecker
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Will Deacon
Cc: Prasad
Cc: Paul Mundt
Cc: v2.6.33..
Link: http://lkml.kernel.org/r/1302284067-7860-2-git-send-email-fweisbec@gmail.com
08 Apr, 2011
1 commit
31 Mar, 2011
1 commit
-
Fixes generated by 'codespell' and manually reviewed.
Signed-off-by: Lucas De Marchi
23 Mar, 2011
3 commits
-
Currently a real parent can't access job control stopped/continued
events through a ptraced child. This utterly breaks job control when
the children are ptraced.For example, if a program is run from an interactive shell and then
strace(1) attaches to it, pressing ^Z would send SIGTSTP and strace(1)
would notice it but the shell has no way to tell whether the child
entered job control stop and thus can't tell when to take over the
terminal - leading to awkward lone ^Z on the terminal.Because the job control and ptrace stopped states are independent,
there is no reason to prevent real parents from accessing the stopped
state regardless of ptrace. The continued state isn't separate but
ptracers don't have any use for them as ptracees can never resume
without explicit command from their ptracers, so as long as ptracers
don't consume it, it should be fine.Although this is a behavior change, because the previous behavior is
utterly broken when viewed from real parents and the change is only
visible to real parents, I don't think it's necessary to make this
behavior optional.One situation to be careful about is when a task from the real
parent's group is ptracing. The parent group is the recipient of both
ptrace and job control stop events and one stop can be reported as
both job control and ptrace stops. As this can break the current
ptrace users, suppress job control stopped events for these cases.If a real parent ptracer wants to know about both job control and
ptrace stops, it can create a separate process to serve the role of
real parent.Note that this only updates wait(2) side of things. The real parent
can access the states via wait(2) but still is not properly notified
(woken up and delivered signal). Test case polls wait(2) with WNOHANG
to work around. Notification will be updated by future patches.Test case follows.
#include
#include
#include
#include
#include
#include
#includeint main(void)
{
const struct timespec ts100ms = { .tv_nsec = 100000000 };
pid_t tracee, tracer;
siginfo_t si;
int i;tracee = fork();
if (tracee == 0) {
while (1) {
printf("tracee: SIGSTOP\n");
raise(SIGSTOP);
nanosleep(&ts100ms, NULL);
printf("tracee: SIGCONT\n");
raise(SIGCONT);
nanosleep(&ts100ms, NULL);
}
}waitid(P_PID, tracee, &si, WSTOPPED | WNOHANG | WNOWAIT);
tracer = fork();
if (tracer == 0) {
nanosleep(&ts100ms, NULL);
ptrace(PTRACE_ATTACH, tracee, NULL, NULL);for (i = 0; i < 11; i++) {
si.si_pid = 0;
waitid(P_PID, tracee, &si, WSTOPPED);
if (si.si_pid && si.si_code == CLD_TRAPPED)
ptrace(PTRACE_CONT, tracee, NULL,
(void *)(long)si.si_status);
}
printf("tracer: EXITING\n");
return 0;
}while (1) {
si.si_pid = 0;
waitid(P_PID, tracee, &si,
WSTOPPED | WCONTINUED | WEXITED | WNOHANG);
if (si.si_pid)
printf("mommy : WAIT status=%02d code=%02d\n",
si.si_status, si.si_code);
nanosleep(&ts100ms, NULL);
}
return 0;
}Before the patch, while ptraced, the parent can't see any job control
events.tracee: SIGSTOP
mommy : WAIT status=19 code=05
tracee: SIGCONT
tracee: SIGSTOP
tracee: SIGCONT
tracee: SIGSTOP
tracee: SIGCONT
tracee: SIGSTOP
tracer: EXITING
mommy : WAIT status=19 code=05
^CAfter the patch,
tracee: SIGSTOP
mommy : WAIT status=19 code=05
tracee: SIGCONT
mommy : WAIT status=18 code=06
tracee: SIGSTOP
mommy : WAIT status=19 code=05
tracee: SIGCONT
mommy : WAIT status=18 code=06
tracee: SIGSTOP
mommy : WAIT status=19 code=05
tracee: SIGCONT
mommy : WAIT status=18 code=06
tracee: SIGSTOP
tracer: EXITING
mommy : WAIT status=19 code=05
^C-v2: Oleg pointed out that wait(2) should be suppressed for the real
parent's group instead of only the real parent task itself.
Updated accordingly.Signed-off-by: Tejun Heo
Acked-by: Oleg Nesterov -
wait(2) and friends allow access to stopped/continued states through
zombies, which is required as the states are process-wide and should
be accessible whether the leader task is alive or undead.
wait_consider_task() implements this by always clearing notask_error
and going through wait_task_stopped/continued() for unreaped zombies.However, while ptraced, the stopped state is per-task and as such if
the ptracee became a zombie, there's no further stopped event to
listen to and wait(2) and friends should return -ECHILD on the tracee.Fix it by clearing notask_error only if WCONTINUED | WEXITED is set
for ptraced zombies. While at it, document why clearing notask_error
is safe for each case.Test case follows.
#include
#include
#include
#include
#include
#include
#includestatic void *nooper(void *arg)
{
pause();
return NULL;
}int main(void)
{
const struct timespec ts1s = { .tv_sec = 1 };
pid_t tracee, tracer;
siginfo_t si;tracee = fork();
if (tracee == 0) {
pthread_t thr;pthread_create(&thr, NULL, nooper, NULL);
nanosleep(&ts1s, NULL);
printf("tracee exiting\n");
pthread_exit(NULL); /* let subthread run */
}tracer = fork();
if (tracer == 0) {
ptrace(PTRACE_ATTACH, tracee, NULL, NULL);
while (1) {
if (waitid(P_PID, tracee, &si, WSTOPPED) < 0) {
perror("waitid");
break;
}
ptrace(PTRACE_CONT, tracee, NULL,
(void *)(long)si.si_status);
}
return 0;
}waitid(P_PID, tracer, &si, WEXITED);
kill(tracee, SIGKILL);
return 0;
}Before the patch, after the tracee becomes a zombie, the tracer's
waitid(WSTOPPED) never returns and the program doesn't terminate.tracee exiting
^CAfter the patch, tracee exiting triggers waitid() to fail.
tracee exiting
waitid: No child processes-v2: Oleg pointed out that exited in addition to continued can happen
for ptraced dead group leader. Clear notask_error for ptraced
child on WEXITED too.Signed-off-by: Tejun Heo
Acked-by: Oleg Nesterov -
Move EXIT_DEAD test in wait_consider_task() above ptrace check. As
ptraced tasks can't be EXIT_DEAD, this change doesn't cause any
behavior change. This is to prepare for further changes.Signed-off-by: Tejun Heo
Acked-by: Oleg Nesterov
10 Mar, 2011
1 commit
-
This patch adds support for creating a queuing context outside
of the queue itself. This enables us to batch up pieces of IO
before grabbing the block device queue lock and submitting them to
the IO scheduler.The context is created on the stack of the process and assigned in
the task structure, so that we can auto-unplug it if we hit a schedule
event.The current queue plugging happens implicitly if IO is submitted to
an empty device, yet callers have to remember to unplug that IO when
they are going to wait for it. This is an ugly API and has caused bugs
in the past. Additionally, it requires hacks in the vm (->sync_page()
callback) to handle that logic. By switching to an explicit plugging
scheme we make the API a lot nicer and can get rid of the ->sync_page()
hack in the vm.Signed-off-by: Jens Axboe
12 Jan, 2011
1 commit
-
…/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (28 commits)
perf session: Fix infinite loop in __perf_session__process_events
perf evsel: Support perf_evsel__open(cpus > 1 && threads > 1)
perf sched: Use PTHREAD_STACK_MIN to avoid pthread_attr_setstacksize() fail
perf tools: Emit clearer message for sys_perf_event_open ENOENT return
perf stat: better error message for unsupported events
perf sched: Fix allocation result check
perf, x86: P4 PMU - Fix unflagged overflows handling
dynamic debug: Fix build issue with older gcc
tracing: Fix TRACE_EVENT power tracepoint creation
tracing: Fix preempt count leak
tracepoint: Add __rcu annotation
tracing: remove duplicate null-pointer check in skb tracepoint
tracing/trivial: Add missing comma in TRACE_EVENT comment
tracing: Include module.h in define_trace.h
x86: Save rbp in pt_regs on irq entry
x86, dumpstack: Fix unused variable warning
x86, NMI: Clean-up default_do_nmi()
x86, NMI: Allow NMI reason io port (0x61) to be processed on any CPU
x86, NMI: Remove DIE_NMI_IPI
x86, NMI: Add priorities to handlers
...
07 Jan, 2011
1 commit
-
In particular this patch move perf_event_exit_task() before
cgroup_exit() to allow for cgroup support. The cgroup_exit()
function detaches the cgroups attached to a task.Other movements include hoisting some definitions and inlines
at the top of perf_event.cSigned-off-by: Stephane Eranian
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar
17 Dec, 2010
1 commit
-
__get_cpu_var() can be replaced with this_cpu_read and will then use a
single read instruction with implied address calculation to access the
correct per cpu instance.However, the address of a per cpu variable passed to __this_cpu_read()
cannot be determined (since it's an implied address conversion through
segment prefixes). Therefore apply this only to uses of __get_cpu_var
where the address of the variable is not used.Cc: Pekka Enberg
Cc: Hugh Dickins
Cc: Thomas Gleixner
Acked-by: H. Peter Anvin
Signed-off-by: Christoph Lameter
Signed-off-by: Tejun Heo
03 Dec, 2010
1 commit
-
If a user manages to trigger an oops with fs set to KERNEL_DS, fs is not
otherwise reset before do_exit(). do_exit may later (via mm_release in
fork.c) do a put_user to a user-controlled address, potentially allowing
a user to leverage an oops into a controlled write into kernel memory.This is only triggerable in the presence of another bug, but this
potentially turns a lot of DoS bugs into privilege escalations, so it's
worth fixing. I have proof-of-concept code which uses this bug along
with CVE-2010-3849 to write a zero to an arbitrary kernel address, so
I've tested that this is not theoretical.A more logical place to put this fix might be when we know an oops has
occurred, before we call do_exit(), but that would involve changing
every architecture, in multiple places.Let's just stick it in do_exit instead.
[akpm@linux-foundation.org: update code comment]
Signed-off-by: Nelson Elhage
Cc: KOSAKI Motohiro
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
06 Nov, 2010
1 commit
-
posix-cpu-timers.c correctly assumes that the dying process does
posix_cpu_timers_exit_group() and removes all !CPUCLOCK_PERTHREAD
timers from signal->cpu_timers list.But, it also assumes that timer->it.cpu.task is always the group
leader, and thus the dead ->task means the dead thread group.This is obviously not true after de_thread() changes the leader.
After that almost every posix_cpu_timer_ method has problems.It is not simple to fix this bug correctly. First of all, I think
that timer->it.cpu should use struct pid instead of task_struct.
Also, the locking should be reworked completely. In particular,
tasklist_lock should not be used at all. This all needs a lot of
nontrivial and hard-to-test changes.Change __exit_signal() to do posix_cpu_timers_exit_group() when
the old leader dies during exec. This is not the fix, just the
temporary hack to hide the problem for 2.6.37 and stable. IOW,
this is obviously wrong but this is what we currently have anyway:
cpu timers do not work after mt exec.In theory this change adds another race. The exiting leader can
detach the timers which were attached to the new leader. However,
the window between de_thread() and release_task() is small, we
can pretend that sys_timer_create() was called before de_thread().Signed-off-by: Oleg Nesterov
Signed-off-by: Linus Torvalds
28 Oct, 2010
1 commit
-
find_new_reaper() releases and regrabs tasklist_lock but was missing
proper annotations. Add it. This remove following sparse warning:warning: context imbalance in 'find_new_reaper' - unexpected unlock
Signed-off-by: Namhyung Kim
Acked-by: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
27 Oct, 2010
1 commit
-
It's pointless to kill a task if another thread sharing its mm cannot be
killed to allow future memory freeing. A subsequent patch will prevent
kills in such cases, but first it's necessary to have a way to flag a task
that shares memory with an OOM_DISABLE task that doesn't incur an
additional tasklist scan, which would make select_bad_process() an O(n^2)
function.This patch adds an atomic counter to struct mm_struct that follows how
many threads attached to it have an oom_score_adj of OOM_SCORE_ADJ_MIN.
They cannot be killed by the kernel, so their memory cannot be freed in
oom conditions.This only requires task_lock() on the task that we're operating on, it
does not require mm->mmap_sem since task_lock() pins the mm and the
operation is atomic.[rientjes@google.com: changelog and sys_unshare() code]
[rientjes@google.com: protect oom_disable_count with task_lock in fork]
[rientjes@google.com: use old_mm for oom_disable_count in exec]
Signed-off-by: Ying Han
Signed-off-by: David Rientjes
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
10 Sep, 2010
1 commit
-
I missed a perf_event_ctxp user when converting it to an array. Pull this
last user into perf_event.c as well and fix it up.Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar
18 Aug, 2010
1 commit
-
Using a program like the following:
#include
#include
#include
#includeint main() {
id_t id;
siginfo_t infop;
pid_t res;id = fork();
if (id == 0) { sleep(1); exit(0); }
kill(id, SIGSTOP);
alarm(1);
waitid(P_PID, id, &infop, WCONTINUED);
return 0;
}to call waitid() on a stopped process results in access to the child task's
credentials without the RCU read lock being held - which may be replaced in the
meantime - eliciting the following warning:===================================================
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
kernel/exit.c:1460 invoked rcu_dereference_check() without protection!other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 1
2 locks held by waitid02/22252:
#0: (tasklist_lock){.?.?..}, at: [] do_wait+0xc5/0x310
#1: (&(&sighand->siglock)->rlock){-.-...}, at: []
wait_consider_task+0x19a/0xbe0stack backtrace:
Pid: 22252, comm: waitid02 Not tainted 2.6.35-323cd+ #3
Call Trace:
[] lockdep_rcu_dereference+0xa4/0xc0
[] wait_consider_task+0xaf1/0xbe0
[] do_wait+0xf5/0x310
[] sys_waitid+0x86/0x1f0
[] ? child_wait_callback+0x0/0x70
[] system_call_fastpath+0x16/0x1bThis is fixed by holding the RCU read lock in wait_task_continued() to ensure
that the task's current credentials aren't destroyed between us reading the
cred pointer and us reading the UID from those credentials.Furthermore, protect wait_task_stopped() in the same way.
We don't need to keep holding the RCU read lock once we've read the UID from
the credentials as holding the RCU read lock doesn't stop the target task from
changing its creds under us - so the credentials may be outdated immediately
after we've read the pointer, lock or no lock.Signed-off-by: Daniel J Blueman
Signed-off-by: David Howells
Acked-by: Paul E. McKenney
Acked-by: Oleg Nesterov
Signed-off-by: Linus Torvalds
11 Aug, 2010
1 commit
-
exit_ptrace() takes tasklist_lock unconditionally. We need this lock to
avoid the race with ptrace_traceme(), it acts as a barrier.Change its caller, forget_original_parent(), to call exit_ptrace() under
tasklist_lock. Change exit_ptrace() to drop and reacquire this lock if
needed.This allows us to add the fastpath list_empty(ptraced) check. In the
likely no-tracees case exit_ptrace() just returns and we avoid the lock()
+ unlock() sequence."Zhang, Yanmin" suggested to add this
check, and he reports that this change adds about 11% improvement in some
tests.Suggested-and-tested-by: "Zhang, Yanmin"
Signed-off-by: Oleg Nesterov
Acked-by: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
28 May, 2010
2 commits
-
No functional changes, just s/atomic_t count/int nr_threads/.
With the recent changes this counter has a single user, get_nr_threads()
And, none of its callers need the really accurate number of threads, not
to mention each caller obviously races with fork/exit. It is only used to
report this value to the user-space, except first_tid() uses it to avoid
the unnecessary while_each_thread() loop in the unlikely case.It is a bit sad we need a word in struct signal_struct for this, perhaps
we can change get_nr_threads() to approximate the number of threads using
signal->live and kill ->nr_threads later.[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Oleg Nesterov
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman"
Acked-by: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Move taskstats_tgid_free() from __exit_signal() to free_signal_struct().
This way signal->stats never points to nowhere and we can read ->stats
lockless.Signed-off-by: Oleg Nesterov
Cc: Balbir Singh
Cc: Roland McGrath
Cc: Veaceslav Falico
Cc: Stanislaw Gruszka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds