Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

09 Aug, 2014

1 commit

a0be55dee kernel/exit.c: fix coding style warnings and errors ... Browse Code »

Fixed coding style warnings and errors.

Signed-off-by: Ionut Alexa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ionut Alexa
2014-08-09 06:57:22 +0800

07 Aug, 2014

1 commit

fb794bcbb mm, oom: remove unnecessary exit_state check ... Browse Code »

The oom killer scans each process and determines whether it is eligible
for oom kill or whether the oom killer should abort because of
concurrent memory freeing. It will abort when an eligible process is
found to have TIF_MEMDIE set, meaning it has already been oom killed and
we're waiting for it to exit.

Processes with task->mm == NULL should not be considered because they
are either kthreads or have already detached their memory and killing
them would not lead to memory freeing. That memory is only freed after
exit_mm() has returned, however, and not when task->mm is first set to
NULL.

Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process
is no longer considered for oom kill, but only until exit_mm() has
returned. This was fragile in the past because it relied on
exit_notify() to be reached before no longer considering TIF_MEMDIE
processes.

Signed-off-by: David Rientjes
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-08-07 09:01:21 +0800

07 Jun, 2014

1 commit

0341729b4 signals: mv {dis,}allow_signal() from sched.h/exit.c to signal.[ch] ... Browse Code »

Move the declaration/definition of allow_signal/disallow_signal to
signal.h/signal.c. The new place is more logical and allows to use the
static helpers in signal.c (see the next changes).

While at it, make them return void and remove the valid_signal() check.
Nobody checks the returned value, and in-kernel users must not pass the
wrong signal number.

Signed-off-by: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Al Viro
Cc: David Woodhouse
Cc: Frederic Weisbecker
Cc: Geert Uytterhoeven
Cc: Ingo Molnar
Cc: Mathieu Desnoyers
Cc: Richard Weinberger
Cc: Steven Rostedt
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-06-07 07:08:11 +0800

05 Jun, 2014

3 commits

39af1765f memcg: optimize the "Search everything else" loop in mm_update_next_owner() ... Browse Code »

for_each_process_thread() is sub-optimal. All threads share the same
->mm, we can swicth to the next process once we found a thread with
->mm != NULL and ->mm != mm.

Signed-off-by: Oleg Nesterov
Reviewed-by: Michal Hocko
Cc: Balbir Singh
Cc: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Peter Chiang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-06-05 07:54:03 +0800
f87fb599a memcg: mm_update_next_owner() should skip kthreads ... Browse Code »

"Search through everything else" in mm_update_next_owner() can hit a
kthread which adopted this "mm" via use_mm(), it should not be used as
mm->owner. Add the PF_KTHREAD check.

While at it, change this code to use for_each_process_thread() instead
of deprecated do_each_thread/while_each_thread.

Signed-off-by: Oleg Nesterov
Reviewed-by: Michal Hocko
Cc: Balbir Singh
Cc: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Peter Chiang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-06-05 07:54:03 +0800
f98bafa06 memcg: kill CONFIG_MM_OWNER ... Browse Code »

CONFIG_MM_OWNER makes no sense. It is not user-selectable, it is only
selected by CONFIG_MEMCG automatically. So we can kill this option in
init/Kconfig and do s/CONFIG_MM_OWNER/CONFIG_MEMCG/ globally.

Signed-off-by: Oleg Nesterov
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-06-05 07:54:01 +0800

08 Apr, 2014

9 commits

7c733eb3e wait: WSTOPPED|WCONTINUED doesn't work if a zombie leader is traced by another process ... Browse Code »

Even if the main thread is dead the process still can stop/continue.
However, if the leader is ptraced wait_consider_task(ptrace => false)
always skips wait_task_stopped/wait_task_continued, so WSTOPPED or
WCONTINUED can never work for the natural parent in this case.

Move the "A zombie ptracee is only visible to its ptracer" check into the
"if (!delay_group_leader(p))" block. ->notask_error is cleared by the
"fall through" code below.

This depends on the previous change, wait_task_stopped/continued must be
avoided if !delay_group_leader() and the tracer is ->real_parent.
Otherwise WSTOPPED|WEXITED could wrongly report "stopped" when the child
is already dead (single-threaded or not). If it is traced by another task
then the "stopped" state is fine until the debugger detaches and reveals a
zombie state.

Stupid test-case:

void *tfunc(void *arg)
{
sleep(1); // wait for zombie leader
raise(SIGSTOP);
exit(0x13);
return NULL;
}

int run_child(void)
{
pthread_t thread;

if (!fork()) {
int tracee = getppid();

assert(ptrace(PTRACE_ATTACH, tracee, 0,0) == 0);
do
ptrace(PTRACE_CONT, tracee, 0,0);
while (wait(NULL) > 0);

return 0;
}

sleep(1); // wait for PTRACE_ATTACH
assert(pthread_create(&thread, NULL, tfunc, NULL) == 0);
pthread_exit(NULL);
}

int main(void)
{
int child, stat;

child = fork();
if (!child)
return run_child();

assert(child == waitpid(-1, &stat, WSTOPPED));
assert(stat == 0x137f);

kill(child, SIGCONT);

assert(child == waitpid(-1, &stat, WCONTINUED));
assert(stat == 0xffff);

assert(child == waitpid(-1, &stat, 0));
assert(stat == 0x1300);

return 0;
}

Without this patch it hangs in waitpid(WSTOPPED), wait_task_stopped() is
never called.

Note: this doesn't fix all problems with a zombie delay_group_leader(),
WCONTINUED | WEXITED check is not exactly right. debugger can't assume it
will be notified if another thread reaps the whole thread group.

Signed-off-by: Oleg Nesterov
Cc: Al Viro
Cc: Jan Kratochvil
Cc: Lennart Poettering
Cc: Michal Schmidt
Cc: Roland McGrath
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-04-08 07:36:06 +0800
377d75daf wait: WSTOPPED|WCONTINUED hangs if a zombie child is traced by real_parent ... Browse Code »

"A zombie is only visible to its ptracer" logic in wait_consider_task()
is very wrong. Trivial test-case:

#include
#include
#include
#include

int main(void)
{
int child = fork();

if (!child) {
assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0);
return 0x23;
}

assert(waitid(P_ALL, child, NULL, WEXITED | WNOWAIT) == 0);
assert(waitid(P_ALL, 0, NULL, WSTOPPED) == -1);
return 0;
}

it hangs in waitpid(WSTOPPED) despite the fact it has a single zombie
child. This is because wait_consider_task(ptrace => 0) sees p->ptrace and
cleares ->notask_error assuming that the debugger should detach and notify
us.

Change wait_consider_task(ptrace => 0) to pretend that ptrace == T if the
child is traced by us. This really simplifies the logic and allows us to
do more fixes, see the next changes. This also hides the unwanted group
stop state automatically, we can remove another ptrace_reparented() check.

Unfortunately, this adds the following behavioural changes:

1. Before this patch wait(WEXITED | __WNOTHREAD) does not reap
a natural child if it is traced by the caller's sub-thread.

Hopefully nobody will ever notice this change, and I think
that nobody should rely on this behaviour anyway.

2. SIGNAL_STOP_CONTINUED is no longer hidden from debugger if
it is real parent.

While this change comes as a side effect, I think it is good
by itself. The group continued state can not be consumed by
another process in this case, it doesn't depend on ptrace,
it doesn't make sense to hide it from real parent.

Perhaps we should add the thread_group_leader() check before
wait_task_continued()? May be, but this shouldn't depend on
ptrace_reparented().

Signed-off-by: Oleg Nesterov
Cc: Al Viro
Cc: Jan Kratochvil
Cc: Lennart Poettering
Cc: Michal Schmidt
Cc: Roland McGrath
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-04-08 07:36:06 +0800
b3ab03160 wait: completely ignore the EXIT_DEAD tasks ... Browse Code »
13

Now that EXIT_DEAD is the terminal state it doesn't make sense to call
eligible_child() or security_task_wait() if the task is really dead.

Signed-off-by: Oleg Nesterov
Tested-by: Michal Schmidt
Cc: Jan Kratochvil
Cc: Al Viro
Cc: Lennart Poettering
Cc: Roland McGrath
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-04-08 07:36:06 +0800
b43606905 wait: use EXIT_TRACE only if thread_group_leader(zombie) ... Browse Code »

wait_task_zombie() always uses EXIT_TRACE/ptrace_unlink() if
ptrace_reparented(). This is suboptimal and a bit confusing: we do not
need do_notify_parent(p) if !thread_group_leader(p) and in this case we
also do not need ptrace_unlink(), we can rely on ptrace_release_task().

Change wait_task_zombie() to check thread_group_leader() along with
ptrace_reparented() and simplify the final p->exit_state transition.

Signed-off-by: Oleg Nesterov
Tested-by: Michal Schmidt
Cc: Jan Kratochvil
Cc: Al Viro
Cc: Lennart Poettering
Cc: Roland McGrath
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-04-08 07:36:05 +0800
abd50b39e wait: introduce EXIT_TRACE to avoid the racy EXIT_DEAD->EXIT_ZOMBIE transition ... Browse Code »

wait_task_zombie() first does EXIT_ZOMBIE->EXIT_DEAD transition and
drops tasklist_lock. If this task is not the natural child and it is
traced, we change its state back to EXIT_ZOMBIE for ->real_parent.

The last transition is racy, this is even documented in 50b8d257486a
"ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE
race". wait_consider_task() tries to detect this transition and clear
->notask_error but we can't rely on ptrace_reparented(), debugger can
exit and do ptrace_unlink() before its sub-thread sets EXIT_ZOMBIE.

And there is another problem which were missed before: this transition
can also race with reparent_leader() which doesn't reset >exit_signal if
EXIT_DEAD, assuming that this task must be reaped by someone else. So
the tracee can be re-parented with ->exit_signal != SIGCHLD, and if
/sbin/init doesn't use __WALL it becomes unreapable. This was fixed by
the previous commit, but it was the temporary hack.

1. Add the new exit_state, EXIT_TRACE. It means that the task is the
traced zombie, debugger is going to detach and notify its natural
parent.

This new state is actually EXIT_ZOMBIE | EXIT_DEAD. This way we
can avoid the changes in proc/kgdb code, get_task_state() still
reports "X (dead)" in this case.

Note: with or without this change userspace can see Z -> X -> Z
transition. Not really bad, but probably makes sense to fix.

2. Change wait_task_zombie() to use EXIT_TRACE instead of EXIT_DEAD
if we need to notify the ->real_parent.

3. Revert the previous hack in reparent_leader(), now that EXIT_DEAD
is always the final state we can safely ignore such a task.

4. Change wait_consider_task() to check EXIT_TRACE separately and kill
the racy and no longer needed ptrace_reparented() case.

If ptrace == T an EXIT_TRACE thread should be simply ignored, the
owner of this state is going to ptrace_unlink() this task. We can
pretend that it was already removed from ->ptraced list.

Otherwise we should skip this thread too but clear ->notask_error,
we must be the natural parent and debugger is going to untrace and
notify us. IOW, this doesn't differ from "EXIT_ZOMBIE && p->ptrace"
even if the task was already untraced.

Signed-off-by: Oleg Nesterov
Reported-by: Jan Kratochvil
Reported-by: Michal Schmidt
Tested-by: Michal Schmidt
Cc: Al Viro
Cc: Lennart Poettering
Cc: Roland McGrath
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-04-08 07:36:05 +0800
dfccbb5e4 wait: fix reparent_leader() vs EXIT_DEAD->EXIT_ZOMBIE race ... Browse Code »
5

wait_task_zombie() first does EXIT_ZOMBIE->EXIT_DEAD transition and
drops tasklist_lock. If this task is not the natural child and it is
traced, we change its state back to EXIT_ZOMBIE for ->real_parent.

The last transition is racy, this is even documented in 50b8d257486a
"ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE
race". wait_consider_task() tries to detect this transition and clear
->notask_error but we can't rely on ptrace_reparented(), debugger can
exit and do ptrace_unlink() before its sub-thread sets EXIT_ZOMBIE.

And there is another problem which were missed before: this transition
can also race with reparent_leader() which doesn't reset >exit_signal if
EXIT_DEAD, assuming that this task must be reaped by someone else. So
the tracee can be re-parented with ->exit_signal != SIGCHLD, and if
/sbin/init doesn't use __WALL it becomes unreapable.

Change reparent_leader() to update ->exit_signal even if EXIT_DEAD.
Note: this is the simple temporary hack for -stable, it doesn't try to
solve all problems, it will be reverted by the next changes.

Signed-off-by: Oleg Nesterov
Reported-by: Jan Kratochvil
Reported-by: Michal Schmidt
Tested-by: Michal Schmidt
Cc: Al Viro
Cc: Lennart Poettering
Cc: Roland McGrath
Cc: Tejun Heo
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-04-08 07:36:05 +0800
ef9823939 kernel/exit.c: call proc_exit_connector() after exit_state is set ... Browse Code »

The process events connector delivers a notification when a process
exits. This is really convenient for a process that spawns and wants to
monitor its children through an epoll-able() interface.

Unfortunately, there is a small window between when the event is
delivered and the child become wait()-able.

This is creates a race if the parent wants to make sure that it knows
about the exit, e.g

pid_t pid = fork();
if (pid > 0) {
register_interest_for_pid(pid);
if (waitpid(pid, NULL, WNOHANG) > 0)
{
/* We might have raced with exit() */
}
return;
}

/* Child */
execve(...)

register_interest_for_pid() would be telling the the connector socket
reader to pay attention to events related to pid.

Though this is not a bug, I think it would make the connector a bit more
usable if this race was closed by simply moving the call to
proc_exit_connector() from just before exit_notify() to right after.

Oleg said:

: Even with this patch the code above is still "racy" if the child is
: multi-threaded. Plus it should obviously filter-out subthreads. And
: afaics there is no way to make it reliable, even if you change the code
: above so that waitpid() is called only after the last thread exits WNOHANG
: still can fail.

Signed-off-by: Guillaume Morin
Cc: Matt Helsley
Cc: Oleg Nesterov
Cc: David S. Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Guillaume Morin
2014-04-08 07:36:04 +0800
4bcb8232c exit: move check_stack_usage() to the end of do_exit() ... Browse Code »

It is not clear why check_stack_usage() is called so early and thus it
never checks the stack usage in, say, exit_notify() or
flush_ptrace_hw_breakpoint() or other functions which are only called by
do_exit().

Move the callsite down to the last preempt_disable/schedule.

Signed-off-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-04-08 07:36:04 +0800
c39df5fa3 exit: call disassociate_ctty() before exit_task_namespaces() ... Browse Code »
5

Commit 8aac62706ada ("move exit_task_namespaces() outside of
exit_notify()") breaks pppd and the exiting service crashes the kernel:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
IP: ppp_register_channel+0x13/0x20 [ppp_generic]
Call Trace:
ppp_asynctty_open+0x12b/0x170 [ppp_async]
tty_ldisc_open.isra.2+0x27/0x60
tty_ldisc_hangup+0x1e3/0x220
__tty_hangup+0x2c4/0x440
disassociate_ctty+0x61/0x270
do_exit+0x7f2/0xa50

ppp_register_channel() needs ->net_ns and current->nsproxy == NULL.

Move disassociate_ctty() before exit_task_namespaces(), it doesn't make
sense to delay it after perf_event_exit_task() or cgroup_exit().

This also allows to use task_work_add() inside the (nontrivial) code
paths in disassociate_ctty().

Investigated by Peter Hurley.

Signed-off-by: Oleg Nesterov
Reported-by: Sree Harsha Totakura
Cc: Peter Hurley
Cc: Sree Harsha Totakura
Cc: "Eric W. Biederman"
Cc: Jeff Dike
Cc: Ingo Molnar
Cc: Andrey Vagin
Cc: Al Viro
Cc: [v3.10+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-04-08 07:36:03 +0800

29 Mar, 2014

1 commit

1ec41830e cgroup: remove useless argument from cgroup_exit() ... Browse Code »

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2014-03-29 21:15:54 +0800

22 Jan, 2014

1 commit

0c740d0af introduce for_each_thread() to replace the buggy while_each_thread() ... Browse Code »
26

while_each_thread() and next_thread() should die, almost every lockless
usage is wrong.

1. Unless g == current, the lockless while_each_thread() is not safe.

while_each_thread(g, t) can loop forever if g exits, next_thread()
can't reach the unhashed thread in this case. Note that this can
happen even if g is the group leader, it can exec.

2. Even if while_each_thread() itself was correct, people often use
it wrongly.

It was never safe to just take rcu_read_lock() and loop unless
you verify that pid_alive(g) == T, even the first next_thread()
can point to the already freed/reused memory.

This patch adds signal_struct->thread_head and task->thread_node to
create the normal rcu-safe list with the stable head. The new
for_each_thread(g, t) helper is always safe under rcu_read_lock() as
long as this task_struct can't go away.

Note: of course it is ugly to have both task_struct->thread_node and the
old task_struct->thread_group, we will kill it later, after we change
the users of while_each_thread() to use for_each_thread().

Perhaps we can kill it even before we convert all users, we can
reimplement next_thread(t) using the new thread_head/thread_node. But
we can't do this right now because this will lead to subtle behavioural
changes. For example, do/while_each_thread() always sees at least one
task, while for_each_thread() can do nothing if the whole thread group
has died. Or thread_group_empty(), currently its semantics is not clear
unless thread_group_leader(p) and we need to audit the callers before we
can change it.

So this patch adds the new interface which has to coexist with the old
one for some time, hopefully the next changes will be more or less
straightforward and the old one will go away soon.

Signed-off-by: Oleg Nesterov
Reviewed-by: Sergey Dyasly
Tested-by: Sergey Dyasly
Reviewed-by: Sameer Nanda
Acked-by: David Rientjes
Cc: "Eric W. Biederman"
Cc: Frederic Weisbecker
Cc: Mandeep Singh Baines
Cc: "Ma, Xindong"
Cc: Michal Hocko
Cc: "Tu, Xiaobing"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-01-22 08:19:46 +0800

10 Jul, 2013

1 commit

7c8df2863 ptrace: revert "Prepare to fix racy accesses on task breakpoints" ... Browse Code »

This reverts commit bf26c018490c ("Prepare to fix racy accesses on task
breakpoints").

The patch was fine but we can no longer race with SIGKILL after commit
9899d11f6544 ("ptrace: ensure arch_ptrace/ptrace_request can never race
with SIGKILL"), the __TASK_TRACED tracee can't be woken up and
->ptrace_bps[] can't go away.

Now that ptrace_get_breakpoints/ptrace_put_breakpoints have no callers,
we can kill them and remove task->ptrace_bp_refcnt.

Signed-off-by: Oleg Nesterov
Acked-by: Frederic Weisbecker
Acked-by: Michael Neuling
Cc: Benjamin Herrenschmidt
Cc: Ingo Molnar
Cc: Jan Kratochvil
Cc: Paul Mackerras
Cc: Paul Mundt
Cc: Will Deacon
Cc: Prasad
Cc: Russell King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-07-10 01:33:26 +0800

04 Jul, 2013

2 commits

7f0ef0267 Merge branch 'akpm' (updates from Andrew Morton) ... Browse Code »

Merge first patch-bomb from Andrew Morton:
- various misc bits
- I'm been patchmonkeying ocfs2 for a while, as Joel and Mark have been
distracted. There has been quite a bit of activity.
- About half the MM queue
- Some backlight bits
- Various lib/ updates
- checkpatch updates
- zillions more little rtc patches
- ptrace
- signals
- exec
- procfs
- rapidio
- nbd
- aoe
- pps
- memstick
- tools/testing/selftests updates

* emailed patches from Andrew Morton : (445 commits)
tools/testing/selftests: don't assume the x bit is set on scripts
selftests: add .gitignore for kcmp
selftests: fix clean target in kcmp Makefile
selftests: add .gitignore for vm
selftests: add hugetlbfstest
self-test: fix make clean
selftests: exit 1 on failure
kernel/resource.c: remove the unneeded assignment in function __find_resource
aio: fix wrong comment in aio_complete()
drivers/w1/slaves/w1_ds2408.c: add magic sequence to disable P0 test mode
drivers/memstick/host/r592.c: convert to module_pci_driver
drivers/memstick/host/jmb38x_ms: convert to module_pci_driver
pps-gpio: add device-tree binding and support
drivers/pps/clients/pps-gpio.c: convert to module_platform_driver
drivers/pps/clients/pps-gpio.c: convert to devm_* helpers
drivers/parport/share.c: use kzalloc
Documentation/accounting/getdelays.c: avoid strncpy in accounting tool
aoe: update internal version number to v83
aoe: update copyright date
aoe: perform I/O completions in parallel
...

Linus Torvalds
2013-07-04 08:12:13 +0800
81dabb464 exit.c: unexport __set_special_pids() ... Browse Code »

Move __set_special_pids() from exit.c to sys.c close to its single caller
and make it static.

And rename it to set_special_pids(), another helper with this name has
gone away.

Signed-off-by: Oleg Nesterov
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-07-04 07:08:02 +0800

28 Jun, 2013

1 commit

207bc1181 Merge branch 'freezer' ... Browse Code »

* freezer:
af_unix: use freezable blocking calls in read
sigtimedwait: use freezable blocking call
nanosleep: use freezable blocking call
futex: use freezable blocking call
select: use freezable blocking call
epoll: use freezable blocking call
binder: use freezable blocking calls
freezer: add new freezable helpers using freezer_do_not_count()
freezer: convert freezable helpers to static inline where possible
freezer: convert freezable helpers to freezer_do_not_count()
freezer: skip waking up tasks with PF_FREEZER_SKIP set
freezer: shorten freezer sleep time using exponential backoff
lockdep: check that no locks held at freeze time
lockdep: remove task argument from debug_check_no_locks_held
freezer: add unsafe versions of freezable helpers for CIFS
freezer: add unsafe versions of freezable helpers for NFS

Rafael J. Wysocki
2013-06-28 19:00:53 +0800

15 Jun, 2013

1 commit

8aac62706 move exit_task_namespaces() outside of exit_notify() ... Browse Code »
36

exit_notify() does exit_task_namespaces() after
forget_original_parent(). This was needed to ensure that ->nsproxy
can't be cleared prematurely, an exiting child we are going to
reparent can do do_notify_parent() and use the parent's (ours) pid_ns.

However, after 32084504 "pidns: use task_active_pid_ns in
do_notify_parent" ->nsproxy != NULL is no longer needed, we rely
on task_active_pid_ns().

Move exit_task_namespaces() from exit_notify() to do_exit(), after
exit_fs() and before exit_task_work().

This solves the problem reported by Andrey, free_ipc_ns()->shm_destroy()
does fput() which needs task_work_add().

Note: this particular problem can be fixed if we change fput(), and
that change makes sense anyway. But there is another reason to move
the callsite. The original reason for exit_task_namespaces() from
the middle of exit_notify() was subtle and it has already gone away,
now this looks confusing. And this allows us do simplify exit_notify(),
we can avoid unlock/lock(tasklist) and we can use ->exit_state instead
of PF_EXITING in forget_original_parent().

Reported-by: Andrey Vagin
Signed-off-by: Oleg Nesterov
Acked-by: "Eric W. Biederman"
Acked-by: Andrey Vagin
Signed-off-by: Al Viro

Oleg Nesterov
2013-06-15 09:39:08 +0800

12 May, 2013

1 commit

1b1d2fb44 lockdep: remove task argument from debug_check_no_locks_held ... Browse Code »
5

The only existing caller to debug_check_no_locks_held calls it
with 'current' as the task, and the freezer needs to call
debug_check_no_locks_held but doesn't already have a current
task pointer, so remove the argument. It is already assuming
that the current task is relevant by dumping the current stack
trace as part of the warning.

This was originally part of 6aa9707099c (lockdep: check that
no locks held at freeze time) which was reverted in
dbf520a9d7d4.

Original-author: Mandeep Singh Baines
Acked-by: Pavel Machek
Acked-by: Tejun Heo
Signed-off-by: Colin Cross
Signed-off-by: Rafael J. Wysocki

Colin Cross
2013-05-12 20:16:21 +0800

02 May, 2013

1 commit

20b4fb485 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull VFS updates from Al Viro,

Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).

7kloc removed.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
don't bother with deferred freeing of fdtables
proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
proc: Make the PROC_I() and PDE() macros internal to procfs
proc: Supply a function to remove a proc entry by PDE
take cgroup_open() and cpuset_open() to fs/proc/base.c
ppc: Clean up scanlog
ppc: Clean up rtas_flash driver somewhat
hostap: proc: Use remove_proc_subtree()
drm: proc: Use remove_proc_subtree()
drm: proc: Use minor->index to label things, not PDE->name
drm: Constify drm_proc_list[]
zoran: Don't print proc_dir_entry data in debug
reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
proc: Supply an accessor for getting the data from a PDE's parent
airo: Use remove_proc_subtree()
rtl8192u: Don't need to save device proc dir PDE
rtl8187se: Use a dir under /proc/net/r8180/
proc: Add proc_mkdir_data()
proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
proc: Move PDE_NET() to fs/proc/proc_net.c
...

Linus Torvalds
2013-05-02 08:51:54 +0800

01 May, 2013

1 commit

08d767608 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal ... Browse Code »

Pull compat cleanup from Al Viro:
"Mostly about syscall wrappers this time; there will be another pile
with patches in the same general area from various people, but I'd
rather push those after both that and vfs.git pile are in."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
syscalls.h: slightly reduce the jungles of macros
get rid of union semop in sys_semctl(2) arguments
make do_mremap() static
sparc: no need to sign-extend in sync_file_range() wrapper
ppc compat wrappers for add_key(2) and request_key(2) are pointless
x86: trim sys_ia32.h
x86: sys32_kill and sys32_mprotect are pointless
get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
merge compat sys_ipc instances
consolidate compat lookup_dcookie()
convert vmsplice to COMPAT_SYSCALL_DEFINE
switch getrusage() to COMPAT_SYSCALL_DEFINE
switch epoll_pwait to COMPAT_SYSCALL_DEFINE
convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
make SYSCALL_DEFINE-generated wrappers do asmlinkage_protect
make HAVE_SYSCALL_WRAPPERS unconditional
consolidate cond_syscall and SYSCALL_ALIAS declarations
teach SYSCALL_DEFINE how to deal with long long/unsigned long long
get rid of duplicate logics in __SC_....[1-6] definitions

Linus Torvalds
2013-05-01 22:21:43 +0800

10 Apr, 2013

1 commit

4b8a8f1e4 get rid of the last free_pipe_info() callers ... Browse Code »

and rename __free_pipe_info() to free_pipe_info()

Signed-off-by: Al Viro

Al Viro
2013-04-10 02:13:02 +0800

01 Apr, 2013

1 commit

dbf520a9d Revert "lockdep: check that no locks held at freeze time" ... Browse Code »

This reverts commit 6aa9707099c4b25700940eb3d016f16c4434360d.

Commit 6aa9707099c4 ("lockdep: check that no locks held at freeze time")
causes problems with NFS root filesystems. The failures were noticed on
OMAP2 and 3 boards during kernel init:

[ BUG: swapper/0/1 still has locks held! ]
3.9.0-rc3-00344-ga937536 #1 Not tainted
-------------------------------------
1 lock held by swapper/0/1:
#0: (&type->s_umount_key#13/1){+.+.+.}, at: [] sget+0x248/0x574

stack backtrace:
rpc_wait_bit_killable
__wait_on_bit
out_of_line_wait_on_bit
__rpc_execute
rpc_run_task
rpc_call_sync
nfs_proc_get_root
nfs_get_root
nfs_fs_mount_common
nfs_try_mount
nfs_fs_mount
mount_fs
vfs_kern_mount
do_mount
sys_mount
do_mount_root
mount_root
prepare_namespace
kernel_init_freeable
kernel_init

Although the rootfs mounts, the system is unstable. Here's a transcript
from a PM test:

http://www.pwsan.com/omap/testlogs/test_v3.9-rc3/20130317194234/pm/37xxevm/37xxevm_log.txt

Here's what the test log should look like:

http://www.pwsan.com/omap/testlogs/test_v3.8/20130218214403/pm/37xxevm/37xxevm_log.txt

Mailing list discussion is here:

http://lkml.org/lkml/2013/3/4/221

Deal with this for v3.9 by reverting the problem commit, until folks can
figure out the right long-term course of action.

Signed-off-by: Paul Walmsley
Cc: Mandeep Singh Baines
Cc: Jeff Layton
Cc: Shawn Guo
Cc:
Cc: Fengguang Wu
Cc: Trond Myklebust
Cc: Ingo Molnar
Cc: Ben Chan
Cc: Oleg Nesterov
Cc: Tejun Heo
Cc: Rafael J. Wysocki
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Walmsley
2013-04-01 02:38:33 +0800

04 Mar, 2013

1 commit

2cf096668 make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect ... Browse Code »

... and switch i386 to HAVE_SYSCALL_WRAPPERS, killing open-coded
uses of asmlinkage_protect() in a bunch of syscalls.

Signed-off-by: Al Viro

Al Viro
2013-03-04 11:58:33 +0800

28 Feb, 2013

2 commits

80d26af89 coredump: use a freezable_schedule for the coredump_finish wait ... Browse Code »

Prevents hung_task detector from panicing the machine. This is also
needed to prevent this wait from blocking suspend.

(It doesnt' currently block suspend but it would once the next
patch in this series is applied.)

[yongjun_wei@trendmicro.com.cn: kernel/exit.c: remove duplicated include]
Signed-off-by: Mandeep Singh Baines
Cc: Ben Chan
Cc: Oleg Nesterov
Cc: Tejun Heo
Cc: Rafael J. Wysocki
Cc: Ingo Molnar
Signed-off-by: Wei Yongjun
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mandeep Singh Baines
2013-02-28 11:10:11 +0800
6aa970709 lockdep: check that no locks held at freeze time ... Browse Code »

We shouldn't try_to_freeze if locks are held. Holding a lock can cause a
deadlock if the lock is later acquired in the suspend or hibernate path
(e.g. by dpm). Holding a lock can also cause a deadlock in the case of
cgroup_freezer if a lock is held inside a frozen cgroup that is later
acquired by a process outside that group.

[akpm@linux-foundation.org: export debug_check_no_locks_held]
Signed-off-by: Mandeep Singh Baines
Cc: Ben Chan
Cc: Oleg Nesterov
Cc: Tejun Heo
Cc: Rafael J. Wysocki
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mandeep Singh Baines
2013-02-28 11:10:11 +0800

28 Jan, 2013

1 commit

6fac4829c cputime: Use accessors to read task cputime stats ... Browse Code »

This is in preparation for the full dynticks feature. While
remotely reading the cputime of a task running in a full
dynticks CPU, we'll need to do some extra-computation. This
way we can account the time it spent tickless in userspace
since its last cputime snapshot.

Signed-off-by: Frederic Weisbecker
Cc: Andrew Morton
Cc: Ingo Molnar
Cc: Li Zhong
Cc: Namhyung Kim
Cc: Paul E. McKenney
Cc: Paul Gortmaker
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Thomas Gleixner

Frederic Weisbecker
2013-01-28 02:23:31 +0800

18 Dec, 2012

1 commit

6a2b60b17 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull user namespace changes from Eric Biederman:
"While small this set of changes is very significant with respect to
containers in general and user namespaces in particular. The user
space interface is now complete.

This set of changes adds support for unprivileged users to create user
namespaces and as a user namespace root to create other namespaces.
The tyranny of supporting suid root preventing unprivileged users from
using cool new kernel features is broken.

This set of changes completes the work on setns, adding support for
the pid, user, mount namespaces.

This set of changes includes a bunch of basic pid namespace
cleanups/simplifications. Of particular significance is the rework of
the pid namespace cleanup so it no longer requires sending out
tendrils into all kinds of unexpected cleanup paths for operation. At
least one case of broken error handling is fixed by this cleanup.

The files under /proc//ns/ have been converted from regular files
to magic symlinks which prevents incorrect caching by the VFS,
ensuring the files always refer to the namespace the process is
currently using and ensuring that the ptrace_mayaccess permission
checks are always applied.

The files under /proc//ns/ have been given stable inode numbers
so it is now possible to see if different processes share the same
namespaces.

Through the David Miller's net tree are changes to relax many of the
permission checks in the networking stack to allowing the user
namespace root to usefully use the networking stack. Similar changes
for the mount namespace and the pid namespace are coming through my
tree.

Two small changes to add user namespace support were commited here adn
in David Miller's -net tree so that I could complete the work on the
/proc//ns/ files in this tree.

Work remains to make it safe to build user namespaces and 9p, afs,
ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the
Kconfig guard remains in place preventing that user namespaces from
being built when any of those filesystems are enabled.

Future design work remains to allow root users outside of the initial
user namespace to mount more than just /proc and /sys."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits)
proc: Usable inode numbers for the namespace file descriptors.
proc: Fix the namespace inode permission checks.
proc: Generalize proc inode allocation
userns: Allow unprivilged mounts of proc and sysfs
userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
procfs: Print task uids and gids in the userns that opened the proc file
userns: Implement unshare of the user namespace
userns: Implent proc namespace operations
userns: Kill task_user_ns
userns: Make create_new_namespaces take a user_ns parameter
userns: Allow unprivileged use of setns.
userns: Allow unprivileged users to create new namespaces
userns: Allow setting a userns mapping to your current uid.
userns: Allow chown and setgid preservation
userns: Allow unprivileged users to create user namespaces.
userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
userns: fix return value on mntns_install() failure
vfs: Allow unprivileged manipulation of the mount namespace.
vfs: Only support slave subtrees across different user namespaces
vfs: Add a user namespace reference from struct mnt_namespace
...

Linus Torvalds
2012-12-18 07:44:47 +0800

13 Dec, 2012

1 commit

9977d9b37 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal ... Browse Code »

Pull big execve/kernel_thread/fork unification series from Al Viro:
"All architectures are converted to new model. Quite a bit of that
stuff is actually shared with architecture trees; in such cases it's
literally shared branch pulled by both, not a cherry-pick.

A lot of ugliness and black magic is gone (-3KLoC total in this one):

- kernel_thread()/kernel_execve()/sys_execve() redesign.

We don't do syscalls from kernel anymore for either kernel_thread()
or kernel_execve():

kernel_thread() is essentially clone(2) with callback run before we
return to userland, the callbacks either never return or do
successful do_execve() before returning.

kernel_execve() is a wrapper for do_execve() - it doesn't need to
do transition to user mode anymore.

As a result kernel_thread() and kernel_execve() are
arch-independent now - they live in kernel/fork.c and fs/exec.c
resp. sys_execve() is also in fs/exec.c and it's completely
architecture-independent.

- daemonize() is gone, along with its parts in fs/*.c

- struct pt_regs * is no longer passed to do_fork/copy_process/
copy_thread/do_execve/search_binary_handler/->load_binary/do_coredump.

- sys_fork()/sys_vfork()/sys_clone() unified; some architectures
still need wrappers (ones with callee-saved registers not saved in
pt_regs on syscall entry), but the main part of those suckers is in
kernel/fork.c now."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (113 commits)
do_coredump(): get rid of pt_regs argument
print_fatal_signal(): get rid of pt_regs argument
ptrace_signal(): get rid of unused arguments
get rid of ptrace_signal_deliver() arguments
new helper: signal_pt_regs()
unify default ptrace_signal_deliver
flagday: kill pt_regs argument of do_fork()
death to idle_regs()
don't pass regs to copy_process()
flagday: don't pass regs to copy_thread()
bfin: switch to generic vfork, get rid of pointless wrappers
xtensa: switch to generic clone()
openrisc: switch to use of generic fork and clone
unicore32: switch to generic clone(2)
score: switch to generic fork/vfork/clone
c6x: sanitize copy_thread(), get rid of clone(2) wrapper, switch to generic clone()
take sys_fork/sys_vfork/sys_clone prototypes to linux/syscalls.h
mn10300: switch to generic fork/vfork/clone
h8300: switch to generic fork/vfork/clone
tile: switch to generic clone()
...

Conflicts:
arch/microblaze/include/asm/Kbuild

Linus Torvalds
2012-12-13 04:22:13 +0800

29 Nov, 2012

2 commits

c4144670f kill daemonize() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-11-29 10:49:02 +0800
e80d0a1ae cputime: Rename thread_group_times to thread_group_cputime_adjusted ... Browse Code »

We have thread_group_cputime() and thread_group_times(). The naming
doesn't provide enough information about the difference between
these two APIs.

To lower the confusion, rename thread_group_times() to
thread_group_cputime_adjusted(). This name better suggests that
it's a version of thread_group_cputime() that does some stabilization
on the raw cputime values. ie here: scale on top of CFS runtime
stats and bound lower value for monotonicity.

Signed-off-by: Frederic Weisbecker
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Steven Rostedt
Cc: Paul Gortmaker

Frederic Weisbecker
2012-11-29 00:07:57 +0800

19 Nov, 2012

1 commit

af4b8a83a pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1 ... Browse Code »

Looking at pid_ns->nr_hashed is a bit simpler and it works for
disjoint process trees that an unshare or a join of a pid_namespace
may create.

Acked-by: "Serge E. Hallyn"
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2012-11-19 21:59:12 +0800

03 Oct, 2012

1 commit

aab174f0d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs update from Al Viro:

- big one - consolidation of descriptor-related logics; almost all of
that is moved to fs/file.c

(BTW, I'm seriously tempted to rename the result to fd.c. As it is,
we have a situation when file_table.c is about handling of struct
file and file.c is about handling of descriptor tables; the reasons
are historical - file_table.c used to be about a static array of
struct file we used to have way back).

A lot of stray ends got cleaned up and converted to saner primitives,
disgusting mess in android/binder.c is still disgusting, but at least
doesn't poke so much in descriptor table guts anymore. A bunch of
relatively minor races got fixed in process, plus an ext4 struct file
leak.

- related thing - fget_light() partially unuglified; see fdget() in
there (and yes, it generates the code as good as we used to have).

- also related - bits of Cyrill's procfs stuff that got entangled into
that work; _not_ all of it, just the initial move to fs/proc/fd.c and
switch of fdinfo to seq_file.

- Alex's fs/coredump.c spiltoff - the same story, had been easier to
take that commit than mess with conflicts. The rest is a separate
pile, this was just a mechanical code movement.

- a few misc patches all over the place. Not all for this cycle,
there'll be more (and quite a few currently sit in akpm's tree)."

Fix up trivial conflicts in the android binder driver, and some fairly
simple conflicts due to two different changes to the sock_alloc_file()
interface ("take descriptor handling from sock_alloc_file() to callers"
vs "net: Providing protocol type via system.sockprotoname xattr of
/proc/PID/fd entries" adding a dentry name to the socket)

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
MAX_LFS_FILESIZE should be a loff_t
compat: fs: Generic compat_sys_sendfile implementation
fs: push rcu_barrier() from deactivate_locked_super() to filesystems
btrfs: reada_extent doesn't need kref for refcount
coredump: move core dump functionality into its own file
coredump: prevent double-free on an error path in core dumper
usb/gadget: fix misannotations
fcntl: fix misannotations
ceph: don't abuse d_delete() on failure exits
hypfs: ->d_parent is never NULL or negative
vfs: delete surplus inode NULL check
switch simple cases of fget_light to fdget
new helpers: fdget()/fdput()
switch o2hb_region_dev_write() to fget_light()
proc_map_files_readdir(): don't bother with grabbing files
make get_file() return its argument
vhost_set_vring(): turn pollstart/pollstop into bool
switch prctl_set_mm_exe_file() to fget_light()
switch xfs_find_handle() to fget_light()
switch xfs_swapext() to fget_light()
...

Linus Torvalds
2012-10-03 11:25:04 +0800

27 Sep, 2012

2 commits

864bdb3b6 new helper: daemonize_descriptors() ... Browse Code »

descriptor-related parts of daemonize, done right. As the
result we simplify the locking rules for ->files - we
hold task_lock in *all* cases when we modify ->files.

Signed-off-by: Al Viro

Al Viro
2012-09-27 09:10:00 +0800
7cf4dc3c8 move files_struct-related bits from kernel/exit.c to fs/file.c ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-09-27 09:08:54 +0800

25 Sep, 2012

1 commit

5640f7685 net: use a per task frag allocator ... Browse Code »
15

We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.

This page is used to build fragments for skbs.

Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)

But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page

Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.

This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.

(up to 32768 bytes per frag, thats order-3 pages on x86)

This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.

Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536

Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)

Signed-off-by: Eric Dumazet
Cc: Ben Hutchings
Cc: Vijay Subramanian
Cc: Alexander Duyck
Tested-by: Vijay Subramanian
Signed-off-by: David S. Miller

Eric Dumazet
2012-09-25 04:31:37 +0800