11 May, 2007
3 commits
-
This patch series implements the new signalfd() system call.
I took part of the original Linus code (and you know how badly it can be
broken :), and I added even more breakage ;) Signals are fetched from the same
signal queue used by the process, so signalfd will compete with standard
kernel delivery in dequeue_signal(). If you want to reliably fetch signals on
the signalfd file, you need to block them with sigprocmask(SIG_BLOCK). This
seems to be working fine on my Dual Opteron machine. I made a quick test
program for it:http://www.xmailserver.org/signafd-test.c
The signalfd() system call implements signal delivery into a file descriptor
receiver. The signalfd file descriptor if created with the following API:int signalfd(int ufd, const sigset_t *mask, size_t masksize);
The "ufd" parameter allows to change an existing signalfd sigmask, w/out going
to close/create cycle (Linus idea). Use "ufd" == -1 if you want a brand new
signalfd file.The "mask" allows to specify the signal mask of signals that we are interested
in. The "masksize" parameter is the size of "mask".The signalfd fd supports the poll(2) and read(2) system calls. The poll(2)
will return POLLIN when signals are available to be dequeued. As a direct
consequence of supporting the Linux poll subsystem, the signalfd fd can use
used together with epoll(2) too.The read(2) system call will return a "struct signalfd_siginfo" structure in
the userspace supplied buffer. The return value is the number of bytes copied
in the supplied buffer, or -1 in case of error. The read(2) call can also
return 0, in case the sighand structure to which the signalfd was attached,
has been orphaned. The O_NONBLOCK flag is also supported, and read(2) will
return -EAGAIN in case no signal is available.If the size of the buffer passed to read(2) is lower than sizeof(struct
signalfd_siginfo), -EINVAL is returned. A read from the signalfd can also
return -ERESTARTSYS in case a signal hits the process. The format of the
struct signalfd_siginfo is, and the valid fields depends of the (->code &
__SI_MASK) value, in the same way a struct siginfo would:struct signalfd_siginfo {
__u32 signo; /* si_signo */
__s32 err; /* si_errno */
__s32 code; /* si_code */
__u32 pid; /* si_pid */
__u32 uid; /* si_uid */
__s32 fd; /* si_fd */
__u32 tid; /* si_fd */
__u32 band; /* si_band */
__u32 overrun; /* si_overrun */
__u32 trapno; /* si_trapno */
__s32 status; /* si_status */
__s32 svint; /* si_int */
__u64 svptr; /* si_ptr */
__u64 utime; /* si_utime */
__u64 stime; /* si_stime */
__u64 addr; /* si_addr */
};[akpm@linux-foundation.org: fix signalfd_copyinfo() on i386]
Signed-off-by: Davide Libenzi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
attach_pid() currently takes a pid_t and then uses find_pid() to find the
corresponding struct pid. Sometimes we already have the struct pid. We can
then skip find_pid() if attach_pid() were to take a struct pid parameter.Signed-off-by: Sukadev Bhattiprolu
Cc: Cedric Le Goater
Cc: Dave Hansen
Cc: Serge Hallyn
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If CONFIG_TASK_IO_ACCOUNTING is defined, we update io accounting counters for
each task.This patch permits reporting of values using the well known getrusage()
syscall, filling ru_inblock and ru_oublock instead of null values.As TASK_IO_ACCOUNTING currently counts bytes counts, we approximate blocks
count doing : nr_blocks = nr_bytes / 512Example of use :
----------------------
After patch is applied, /usr/bin/time command can now give a good
approximation of IO that the process had to do.$ /usr/bin/time grep tototo /usr/include/*
Command exited with non-zero status 1
0.00user 0.02system 0:02.11elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k
24288inputs+0outputs (0major+259minor)pagefaults 0swaps$ /usr/bin/time dd if=/dev/zero of=/tmp/testfile count=1000
1000+0 enregistrements lus
1000+0 enregistrements écrits
512000 octets (512 kB) copiés, 0,00326601 seconde, 157 MB/s
0.00user 0.00system 0:00.00elapsed 80%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+3000outputs (0major+299minor)pagefaults 0swapsSigned-off-by: Eric Dumazet
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
10 May, 2007
2 commits
-
Currently kernel threads use sigprocmask(SIG_BLOCK) to protect against
signals. This doesn't prevent the signal delivery, this only blocks
signal_wake_up(). Every "killall -33 kthreadd" means a "struct siginfo"
leak.Change kthreadd_setup() to set all handlers to SIG_IGN instead of blocking
them (make a new helper ignore_signals() for that). If the kernel thread
needs some signal, it should use allow_signal() anyway, and in that case it
should not use CLONE_SIGHAND.Note that we can't change daemonize() (should die!) in the same way,
because it can be used along with CLONE_SIGHAND. This means that
allow_signal() still should unblock the signal to work correctly with
daemonize()ed threads.However, disallow_signal() doesn't block the signal any longer but ignores
it.NOTE: with or without this patch the kernel threads are not protected from
handle_stop_signal(), this seems harmless, but not good.Signed-off-by: Oleg Nesterov
Acked-by: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When a kernel thread calls daemonize, instead of reparenting the thread to
init reparent the thread to kthreadd next to the threads created by
kthread_create.This is really just a stop gap until daemonize goes away, but it does
ensure no kernel threads are under init and they are all in one place that
is easy to find.Signed-off-by: Eric W. Biederman
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
09 May, 2007
1 commit
-
Remove includes of where it is not used/needed.
Suggested by Al Viro.Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
sparc64, and arm (all 59 defconfigs).Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
08 May, 2007
1 commit
-
wait* syscalls return -ECHILD even when an individual PID of a live child
was requested explicitly, when security_task_wait denies the operation.
This means that something like a broken SELinux policy can produce an
unexpected failure that looks just like a bug with wait or ptrace or
something.This patch makes do_wait return -EACCES (or other appropriate error returned
from security_task_wait() instead of -ECHILD if some children were ruled out
solely because security_task_wait failed.[jmorris@namei.org: switch error code to EACCES]
Signed-off-by: Roland McGrath
Acked-by: Stephen Smalley
Cc: Chris Wright
Cc: James Morris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
29 Mar, 2007
1 commit
-
In commit 0475ac0845f9295bc5f69af45f58dff2c104c8d1 when converting the
orphaned process group handling to use struct pid I made a small
mistake. I accidentally replaced an == with a !=.Besides just being a dumb thing to do apparently this has a bad side
effect. The improper orphaned process group detection causes kwin to
die after a suspend/resume cycle.I'm amazed this patch has been around as long as it has without anyone
else noticing something funny going on.And the following people deserve credit for spotting and helping
to reproduce this.Thanks to: Sid Boyce
Thanks to: "Michael Wu"Signed-off-by: "Eric W. Biederman"
Signed-off-by: Linus Torvalds
13 Feb, 2007
4 commits
-
Every call to is_orphaned_pgrp passed in process_group(current) which is racy
with respect to another thread changing our process group. It didn't bite us
because we were dealing with integers and the worse we would get would be a
stale answer.In switching the checks to use struct pid to be a little more efficient and
prepare the way for pid namespaces this race became apparent.So I simplified the calls to the more specialized is_current_pgrp_orphaned so
I didn't have to worry about making logic changes to avoid the race.Signed-off-by: Eric W. Biederman
Cc: Alan Cox
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Modify has_stopped_jobs and will_become_orphan_pgrp to use struct pid based
process groups. This reduces the number of hash tables looks ups and paves
the way for multiple pid spaces.Signed-off-by: Eric W. Biederman
Cc: Alan Cox
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
To properly implement a pid namespace I need to deal exclusively in terms of
struct pid, because pid_t values become ambiguous.To this end session_of_pgrp is transformed to take and return a struct pid
pointer. To avoid the need to worry about reference counting I now require my
caller to hold the appropriate locks. Leaving callers repsonsible for
increasing the reference count if they need access to the result outside of
the locks.Since session_of_pgrp currently only has one caller and that caller simply
uses only test the result for equality with another process group, the locking
change means I don't actually have to acquire the tasklist_lock at all.tiocspgrp is also modified to take and release the lock. The logic there is a
little more complicated but nothing I won't need when I convert pgrp of a tty
to a struct pid pointer.Signed-off-by: Eric W. Biederman
Cc: Alan Cox
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
close_files() can sometimes take long enough to trigger the soft lockup
detector.Cc: Eric Dumazet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Feb, 2007
1 commit
-
A variety of (mostly) innocuous fixes to the embedded kernel-doc content in
source files, including:* make multi-line initial descriptions single line
* denote some function names, constants and structs as such
* change erroneous opening '/*' to '/**' in a few places
* reword some text for claritySigned-off-by: Robert P. J. Day
Cc: "Randy.Dunlap"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
31 Jan, 2007
3 commits
-
This is based on a patch by Eric W. Biederman, who pointed out that pid
namespaces are still fake, and we only have one ever active.So for the time being, we can modify any code which could access
tsk->nsproxy->pid_ns during task exit to just use &init_pid_ns instead,
and move the exit_task_namespaces call in do_exit() back above
exit_notify(), so that an exiting nfs server has a valid tsk->sighand to
work with.Long term, pulling pid_ns out of nsproxy might be the cleanest solution.
Signed-off-by: Eric W. Biederman
[ Eric's patch fixed to take care of free_pid() too ]
Signed-off-by: Serge E. Hallyn
Signed-off-by: Linus Torvalds -
This reverts commit 7a238fcba0629b6f2edbcd37458bae56fcf36be5 in
preparation for a better and simpler fix proposed by Eric Biederman
(and fixed up by Serge Hallyn)Acked-by: Serge E. Hallyn
Signed-off-by: Linus Torvalds -
Fix exit race by splitting the nsproxy putting into two pieces. First
piece reduces the nsproxy refcount. If we dropped the last reference, then
it puts the mnt_ns, and returns the nsproxy as a hint to the caller. Else
it returns NULL. The second piece of exiting task namespaces sets
tsk->nsproxy to NULL, and drops the references to other namespaces and
frees the nsproxy only if an nsproxy was passed in.A little awkward and should probably be reworked, but hopefully it fixes
the NFS oops.Signed-off-by: Serge E. Hallyn
Cc: Herbert Poetzl
Cc: Oleg Nesterov
Cc: "Eric W. Biederman"
Cc: Cedric Le Goater
Cc: Daniel Hokka Zakrisson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
01 Jan, 2007
1 commit
-
Commit b2b2cbc4b2a2f389442549399a993a8306420baf introduced a user-
visible change: ->pdeath_signal is sent only when the entire thread
group exits.While this change is imho good, it may break things. So restore the
old behaviour for now.Signed-off-by: Oleg Nesterov
To: Albert Cahalan
Cc: Eric W. Biederman
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Ingo Molnar
Cc: Qi Yong
Cc: Roland McGrath
Signed-off-by: Linus Torvalds
23 Dec, 2006
2 commits
-
This patch fixes the case when we reparent to a different thread in the
same thread group. This modifies the code so that we do not send
signals and do not change the signal to send to SIGCHLD unless we have
change the thread group of our parents. It also suppresses sending
pdeath_sig in this cas as well since the result of geppid doesn't
change.Thanks to Oleg for spotting my bug of only fixing this for non-ptraced
tasks.Signed-off-by: Eric W. Biederman
Cc: Mike Galbraith
Cc: Albert Cahalan
Cc: Andrew Morton
Cc: Roland McGrath
Cc: Ingo Molnar
Cc: Coywolf Qi Hunt
Acked-by: Oleg Nesterov
Signed-off-by: Linus Torvalds -
Christoph Hellwig has expressed concerns that the recent fdtable changes
expose the details of the RCU methodology used to release no-longer-used
fdtable structures to the rest of the kernel. The trivial patch below
addresses these concerns by introducing the appropriate free_fdtable()
calls, which simply wrap the release RCU usage. Since free_fdtable() is a
one-liner, it makes sense to promote it to an inline helper.Signed-off-by: Vadim Lobanov
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 Dec, 2006
2 commits
-
An fdtable can either be embedded inside a files_struct or standalone (after
being expanded). When an fdtable is being discarded after all RCU references
to it have expired, we must either free it directly, in the standalone case,
or free the files_struct it is contained within, in the embedded case.Currently the free_files field controls this behavior, but we can get rid of
it entirely, as all the necessary information is already recorded. We can
distinguish embedded and standalone fdtables using max_fds, and if it is
embedded we can divine the relevant files_struct using container_of().Signed-off-by: Vadim Lobanov
Cc: Christoph Hellwig
Cc: Al Viro
Cc: Dipankar Sarma
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently, each fdtable supports three dynamically-sized arrays of data: the
fdarray and two fdsets. The code allows the number of fds supported by the
fdarray (fdtable->max_fds) to differ from the number of fds supported by each
of the fdsets (fdtable->max_fdset).In practice, it is wasteful for these two sizes to differ: whenever we hit a
limit on the smaller-capacity structure, we will reallocate the entire fdtable
and all the dynamic arrays within it, so any delta in the memory used by the
larger-capacity structure will never be touched at all.Rather than hogging this excess, we shouldn't even allocate it in the first
place, and keep the capacities of the fdarray and the fdsets equal. This
patch removes fdtable->max_fdset. As an added bonus, most of the supporting
code becomes simpler.Signed-off-by: Vadim Lobanov
Cc: Christoph Hellwig
Cc: Al Viro
Cc: Dipankar Sarma
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
09 Dec, 2006
7 commits
-
All members of the process group have the same sid and it can't be == 0.
NOTE: this code (and a similar one in sys_setpgid) was needed because it
was possibe to have ->session == 0. It's not possible any longer since[PATCH] pidhash: don't use zero pids
Commit: c7c6464117a02b0d54feb4ebeca4db70fa493678Signed-off-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add a per pid_namespace child-reaper. This is needed so processes are reaped
within the same pid space and do not spill over to the parent pid space. Its
also needed so containers preserve existing semantic that pid == 1 would reap
orphaned children.This is based on Eric Biederman's patch: http://lkml.org/lkml/2006/2/6/285
Signed-off-by: Sukadev Bhattiprolu
Signed-off-by: Cedric Le Goater
Cc: Kirill Korotaev
Cc: Eric W. Biederman
Cc: Herbert Poetzl
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Rename 'struct namespace' to 'struct mnt_namespace' to avoid confusion with
other namespaces being developped for the containers : pid, uts, ipc, etc.
'namespace' variables and attributes are also renamed to 'mnt_ns'Signed-off-by: Kirill Korotaev
Signed-off-by: Cedric Le Goater
Cc: Eric W. Biederman
Cc: Herbert Poetzl
Cc: Sukadev Bhattiprolu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add an anonymous union and ((deprecated)) to catch direct usage of the
session field.[akpm@osdl.org: fix various missed conversions]
[jdike@addtoit.com: fix UML bug]
Signed-off-by: Jeff Dike
Cc: Cedric Le Goater
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Replace occurences of task->signal->session by a new process_session() helper
routine.It will be useful for pid namespaces to abstract the session pid number.
Signed-off-by: Cedric Le Goater
Cc: Kirill Korotaev
Cc: Eric W. Biederman
Cc: Herbert Poetzl
Cc: Sukadev Bhattiprolu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Make set_special_pids() static, the only caller is daemonize().
Signed-off-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Fix the locking of signal->tty.
Use ->sighand->siglock to protect ->signal->tty; this lock is already used
by most other members of ->signal/->sighand. And unless we are 'current'
or the tasklist_lock is held we need ->siglock to access ->signal anyway.(NOTE: sys_unshare() is broken wrt ->sighand locking rules)
Note that tty_mutex is held over tty destruction, so while holding
tty_mutex any tty pointer remains valid. Otherwise the lifetime of ttys
are governed by their open file handles. This leaves some holes for tty
access from signal->tty (or any other non file related tty access).It solves the tty SLAB scribbles we were seeing.
(NOTE: the change from group_send_sig_info to __group_send_sig_info needs to
be examined by someone familiar with the security framework, I think
it is safe given the SEND_SIG_PRIV from other __group_send_sig_info
invocations)[schwidefsky@de.ibm.com: 3270 fix]
[akpm@osdl.org: various post-viro fixes]
Signed-off-by: Peter Zijlstra
Acked-by: Alan Cox
Cc: Oleg Nesterov
Cc: Prarit Bhargava
Cc: Chris Wright
Cc: Roland McGrath
Cc: Stephen Smalley
Cc: James Morris
Cc: "David S. Miller"
Cc: Jeff Dike
Cc: Martin Schwidefsky
Cc: Jan Kara
Signed-off-by: Martin Schwidefsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
08 Dec, 2006
1 commit
-
do_exit:
taskstats_exit_alloc()
...
taskstats_exit_send()
taskstats_exit_free()I think this is not good, let it be a single function exported to the core
kernel, taskstats_exit(), which does alloc + send + free itself.Signed-off-by: Oleg Nesterov
Cc: Balbir Singh
Cc: Shailabh Nagar
Cc: Jay Lan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
29 Oct, 2006
1 commit
-
taskstats_tgid_free() is called on copy_process's error path. This is wrong.
IF (clone_flags & CLONE_THREAD)
We should not clear ->signal->taskstats, current uses it,
it probably has a valid accumulated info.
ELSE
taskstats_tgid_init() set ->signal->taskstats = NULL,
there is nothing to free.Move the callsite to __exit_signal(). We don't need any locking, entire
thread group is exiting, nobody should have a reference to soon to be
released ->signal.Signed-off-by: Oleg Nesterov
Cc: Shailabh Nagar
Cc: Balbir Singh
Cc: Jay Lan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
02 Oct, 2006
3 commits
-
exit_task_namespaces() has replaced the former exit_namespace(). It
invalidates task->nsproxy and associated namespaces. This is an issue for
the (futur) pid namespace which is required to be valid in exit_notify().This patch moves exit_task_namespaces() after exit_notify() to keep nsproxy
valid.Signed-off-by: Cedric Le Goater
Cc: Serge E. Hallyn
Cc: Kirill Korotaev
Cc: "Eric W. Biederman"
Cc: Herbert Poetzl
Cc: Andrey Savochkin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This moves the mount namespace into the nsproxy. The mount namespace count
now refers to the number of nsproxies point to it, rather than the number of
tasks. As a result, the unshare_namespace() function in kernel/fork.c no
longer checks whether it is being shared.Signed-off-by: Serge Hallyn
Cc: Kirill Korotaev
Cc: "Eric W. Biederman"
Cc: Herbert Poetzl
Cc: Andrey Savochkin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch adds a nsproxy structure to the task struct. Later patches will
move the fs namespace pointer into this structure, and introduce a new utsname
namespace into the nsproxy.The vserver and openvz functionality, then, would be implemented in large part
by virtualizing/isolating more and more resources into namespaces, each
contained in the nsproxy.[akpm@osdl.org: build fix]
Signed-off-by: Serge Hallyn
Cc: Kirill Korotaev
Cc: "Eric W. Biederman"
Cc: Herbert Poetzl
Cc: Andrey Savochkin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
01 Oct, 2006
2 commits
-
There were a few accounting data/macros that are used in CSA but are #ifdef'ed
inside CONFIG_BSD_PROCESS_ACCT. This patch is to change those ifdef's from
CONFIG_BSD_PROCESS_ACCT to CONFIG_TASK_XACCT. A few defines are moved from
kernel/acct.c and include/linux/acct.h to kernel/tsacct.c and
include/linux/tsacct_kern.h.Signed-off-by: Jay Lan
Cc: Shailabh Nagar
Cc: Balbir Singh
Cc: Jes Sorensen
Cc: Chris Sturtivant
Cc: Tony Ernst
Cc: Guillaume Thouvenin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Remove the duplicate declaration of exit_io_context() from linux/sched.h.
Signed-Off-By: David Howells
Signed-off-by: Jens Axboe
30 Sep, 2006
5 commits
-
I am not sure about this patch, I am asking Ingo to take a decision.
task_struct->state == EXIT_DEAD is a very special case, to avoid a confusion
it makes sense to introduce a new state, TASK_DEAD, while EXIT_DEAD should
live only in ->exit_state as documented in sched.h.Note that this state is not visible to user-space, get_task_state() masks off
unsuitable states.Signed-off-by: Oleg Nesterov
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
After the previous change (->flags & PF_DEAD) (->state == EXIT_DEAD), we
don't need PF_DEAD any longer.Signed-off-by: Oleg Nesterov
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
schedule() checks PF_DEAD on every context switch and sets ->state = EXIT_DEAD
to ensure that the exiting task will be deactivated. Note that this EXIT_DEAD
is in fact a "random" value, we can use any bit except normal TASK_XXX values.It is better to set this state in do_exit() along with PF_DEAD flag and remove
that check in schedule().We are safe wrt concurrent try_to_wake_up() (for example ptrace, tkill), it
can not change task's ->state: the 'state' argument of try_to_wake_up() can't
have EXIT_DEAD bit. And in case when try_to_wake_up() sees a stale value of
->state == TASK_RUNNING it will do nothing.Signed-off-by: Oleg Nesterov
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Remove open-coded has_rt_policy(), no changes in kernel/exit.o
Signed-off-by: Oleg Nesterov
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Steven Rostedt
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If we are going to BUG() not panic() here then we should cover the case of
the BUG being compiled outSigned-off-by: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds