25 Apr, 2008

2 commits

  • * let unshare_files() give caller the displaced files_struct
    * don't bother with grabbing reference only to drop it in the
    caller if it hadn't been shared in the first place
    * in that form unshare_files() is trivially implemented via
    unshare_fd(), so we eliminate the duplicate logics in fork.c
    * reset_files_struct() is not just only called for current;
    it will break the system if somebody ever calls it for anything
    else (we can't modify ->files of somebody else). Lose the
    task_struct * argument.

    Signed-off-by: Al Viro

    Al Viro
     
  • * unshare_files() can fail; doing it after irreversible actions is wrong
    and de_thread() is certainly irreversible.
    * since we do it unconditionally anyway, we might as well do it in do_execve()
    and save ourselves the PITA in binfmt handlers, etc.
    * while we are at it, binfmt_som actually leaked files_struct on failure.

    As a side benefit, unshare_files(), put_files_struct() and reset_files_struct()
    become unexported.

    Signed-off-by: Al Viro

    Al Viro
     

04 Mar, 2008

1 commit

  • The new code that removed the limitation on the execve string size
    (which was historically 32 pages) replaced it with a much softer limit
    based on RLIMIT_STACK which is usually much larger than the traditional
    limit. See commit b6a2fea39318e43fee84fa7b0b90d68bed92d2ba ("mm:
    variable length argument support") for details.

    However, if you have a small stack limit (perhaps because you need lots
    of stacks in a threaded environment), the new heuristic of allowing up
    to 1/4th of RLIMIT_STACK to be used for argument and environment strings
    could actually be smaller than the old limit.

    So just say that it's ok to have up to ARG_MAX strings regardless of the
    value of RLIMIT_STACK, and check the rlimit only when going over that
    traditional limit.

    (Of course, if you actually have a *really* small stack limit, the whole
    stack itself will be limited before you hit ARG_MAX, but that has always
    been true and is clearly the right behaviour anyway).

    Acked-by: Carlos O'Donell
    Cc: Michael Kerrisk
    Cc: Peter Zijlstra
    Cc: Ollie Wild
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 Feb, 2008

2 commits

  • * Add path_put() functions for releasing a reference to the dentry and
    vfsmount of a struct path in the right order

    * Switch from path_release(nd) to path_put(&nd->path)

    * Rename dput_path() to path_put_conditional()

    [akpm@linux-foundation.org: fix cifs]
    Signed-off-by: Jan Blunck
    Signed-off-by: Andreas Gruenbacher
    Acked-by: Christoph Hellwig
    Cc:
    Cc: Al Viro
    Cc: Steven French
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Blunck
     
  • This is the central patch of a cleanup series. In most cases there is no good
    reason why someone would want to use a dentry for itself. This series reflects
    that fact and embeds a struct path into nameidata.

    Together with the other patches of this series
    - it enforced the correct order of getting/releasing the reference count on
    pairs
    - it prepares the VFS for stacking support since it is essential to have a
    struct path in every place where the stack can be traversed
    - it reduces the overall code size:

    without patch series:
    text data bss dec hex filename
    5321639 858418 715768 6895825 6938d1 vmlinux

    with patch series:
    text data bss dec hex filename
    5320026 858418 715768 6894212 693284 vmlinux

    This patch:

    Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix cifs]
    [akpm@linux-foundation.org: fix smack]
    Signed-off-by: Jan Blunck
    Signed-off-by: Andreas Gruenbacher
    Acked-by: Christoph Hellwig
    Cc: Al Viro
    Cc: Casey Schaufler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Blunck
     

09 Feb, 2008

3 commits

  • This allows us to use executables >2GB.

    Based on a patch by Dave Anderson

    Signed-off-by: Andi Kleen
    Cc: Dave Anderson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Suppress A.OUT library support if CONFIG_ARCH_SUPPORTS_AOUT is not set.

    Not all architectures support the A.OUT binfmt, so the ELF binfmt should not
    be permitted to go looking for A.OUT libraries to load in such a case. Not
    only that, but under such conditions A.OUT core dumps are not produced either.

    To make this work, this patch also does the following:

    (1) Makes the existence of the contents of linux/a.out.h contingent on
    CONFIG_ARCH_SUPPORTS_AOUT.

    (2) Renames dump_thread() to aout_dump_thread() as it's only called by A.OUT
    core dumping code.

    (3) Moves aout_dump_thread() into asm/a.out-core.h and makes it inline. This
    is then included only where needed. This means that this bit of arch
    code will be stored in the appropriate A.OUT binfmt module rather than
    the core kernel.

    (4) Drops A.OUT support for Blackfin (according to Mike Frysinger it's not
    needed) and FRV.

    This patch depends on the previous patch to move STACK_TOP[_MAX] out of
    asm/a.out.h and into asm/processor.h as they're required whether or not A.OUT
    format is available.

    [jdike@addtoit.com: uml: re-remove accidentally restored code]
    Signed-off-by: David Howells
    Cc:
    Signed-off-by: Jeff Dike
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • signal_struct->tsk points to the ->group_leader and thus we have the nasty
    code in de_thread() which has to change it and restart ->real_timer if the
    leader is changed.

    Use "struct pid *leader_pid" instead. This also allows us to kill now
    unneeded send_group_sig_info().

    Signed-off-by: Oleg Nesterov
    Acked-by: "Eric W. Biederman"
    Cc: Davide Libenzi
    Cc: Pavel Emelyanov
    Acked-by: Roland McGrath
    Acked-by: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

06 Feb, 2008

2 commits

  • As Roland pointed out, we have the very old problem with exec. de_thread()
    sets SIGNAL_GROUP_EXIT, kills other threads, changes ->group_leader and then
    clears signal->flags. All signals (even fatal ones) sent in this window
    (which is not too small) will be lost.

    With this patch exec doesn't abuse SIGNAL_GROUP_EXIT. signal_group_exit(),
    the new helper, should be used to detect exit_group() or exec() in progress.
    It can have more users, but this patch does only strictly necessary changes.

    Signed-off-by: Oleg Nesterov
    Cc: Davide Libenzi
    Cc: Ingo Molnar
    Cc: Robin Holt
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • It was dumb to make get_task_comm() return void. Change it to return a
    pointer to the resulting output for caller convenience.

    Cc: Ulrich Drepper
    Cc: Ingo Molnar
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

29 Nov, 2007

1 commit

  • fix: http://bugzilla.kernel.org/show_bug.cgi?id=3043

    only allow coredumping to the same uid that the coredumping
    task runs under.

    Signed-off-by: Ingo Molnar
    Acked-by: Alan Cox
    Acked-by: Christoph Hellwig
    Acked-by: Al Viro
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

13 Nov, 2007

1 commit

  • The coredump code always calls set_dumpable(0) when it starts (even
    if RLIMIT_CORE prevents any core from being dumped). The effect of
    this (via task_dumpable) is to make /proc/pid/* files owned by root
    instead of the user, so the user can no longer examine his own
    process--in a case where there was never any privileged data to
    protect. This affects e.g. auxv, environ, fd; in Fedora (execshield)
    kernels, also maps. In practice, you can only notice this when a
    debugger has requested PTRACE_EVENT_EXIT tracing.

    set_dumpable was only used in do_coredump for synchronization and not
    intended for any security purpose. (It doesn't secure anything that wasn't
    already unsecured when a process dies by SIGTERM instead of SIGQUIT.)

    This changes do_coredump to check the core_waiters count as the means of
    synchronization, which is sufficient. Now we leave the "dumpable" bits alone.

    Signed-off-by: Roland McGrath
    Signed-off-by: Linus Torvalds

    Roland McGrath
     

20 Oct, 2007

6 commits

  • With pid namespaces this field is now dangerous to use explicitly, so hide
    it behind the helpers.

    Also the pid and pgrp fields o task_struct and signal_struct are to be
    deprecated. Unfortunately this patch cannot be sent right now as this
    leads to tons of warnings, so start isolating them, and deprecate later.

    Actually the p->tgid == pid has to be changed to has_group_leader_pid(),
    but Oleg pointed out that in case of posix cpu timers this is the same, and
    thread_group_leader() is more preferable.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • This is the largest patch in the set. Make all (I hope) the places where
    the pid is shown to or get from user operate on the virtual pids.

    The idea is:
    - all in-kernel data structures must store either struct pid itself
    or the pid's global nr, obtained with pid_nr() call;
    - when seeking the task from kernel code with the stored id one
    should use find_task_by_pid() call that works with global pids;
    - when showing pid's numerical value to the user the virtual one
    should be used, but however when one shows task's pid outside this
    task's namespace the global one is to be used;
    - when getting the pid from userspace one need to consider this as
    the virtual one and use appropriate task/pid-searching functions.

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: nuther build fix]
    [akpm@linux-foundation.org: yet nuther build fix]
    [akpm@linux-foundation.org: remove unneeded casts]
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Alexey Dobriyan
    Cc: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Use task_pid() to get leader's 'struct pid' and avoid the find_pid().

    Signed-off-by: Sukadev Bhattiprolu
    Acked-by: Pavel Emelianov
    Cc: Eric W. Biederman
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc: Herbert Poetzel
    Cc: Kirill Korotaev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • Rename the child_reaper() function to task_child_reaper() to be similar to
    other task_* functions and to distinguish the function from 'struct
    pid_namspace.child_reaper'.

    Signed-off-by: Sukadev Bhattiprolu
    Cc: Pavel Emelianov
    Cc: Eric W. Biederman
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc: Herbert Poetzel
    Cc: Kirill Korotaev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • With multiple pid namespaces, a process is known by some pid_t in every
    ancestor pid namespace. Every time the process forks, the child process also
    gets a pid_t in every ancestor pid namespace.

    While a process is visible in >=1 pid namespaces, it can see pid_t's in only
    one pid namespace. We call this pid namespace it's "active pid namespace",
    and it is always the youngest pid namespace in which the process is known.

    This patch defines and uses a wrapper to find the active pid namespace of a
    process. The implementation of the wrapper will be changed in when support
    for multiple pid namespaces are added.

    Changelog:
    2.6.22-rc4-mm2-pidns1:
    - [Pavel Emelianov, Alexey Dobriyan] Back out the change to use
    task_active_pid_ns() in child_reaper() since task->nsproxy
    can be NULL during task exit (so child_reaper() continues to
    use init_pid_ns).

    to implement child_reaper() since init_pid_ns.child_reaper to
    implement child_reaper() since tsk->nsproxy can be NULL during exit.

    2.6.21-rc6-mm1:
    - Rename task_pid_ns() to task_active_pid_ns() to reflect that a
    process can have multiple pid namespaces.

    Signed-off-by: Sukadev Bhattiprolu
    Acked-by: Pavel Emelianov
    Cc: Eric W. Biederman
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc: Herbert Poetzel
    Cc: Kirill Korotaev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • This patch uses vm_get_page_prot() to setup vma->vm_page_prot.

    Though inside vm_get_page_prot() the protection flags is AND with
    (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED), it does not hurt correct code.

    Signed-off-by: Coly Li
    Cc: Hugh Dickins
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Coly Li
     

17 Oct, 2007

11 commits

  • This patch contains the following cleanups that are now possible:
    - remove the unused security_operations->inode_xattr_getsuffix
    - remove the no longer used security_operations->unregister_security
    - remove some no longer required exit code
    - remove a bunch of no longer used exports

    Signed-off-by: Adrian Bunk
    Acked-by: James Morris
    Cc: Chris Wright
    Cc: Stephen Smalley
    Cc: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • de_thread() yields waiting for ->group_leader to be a zombie. This deadlocks
    if an rt-prio execer shares the same cpu with ->group_leader. Change the code
    to use ->group_exit_task/notify_count mechanics.

    This patch certainly uglifies the code, perhaps someone can suggest something
    better.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Now that we don't pre-allocate the new ->sighand, we can kill the first fast
    path, it doesn't make sense any longer. At best, it can save one "list_empty()"
    check but leads to the code duplication.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • de_thread() pre-allocates newsighand to make sure that exec() can't fail after
    killing all sub-threads. Imho, this buys nothing, but complicates the code:

    - this is (mostly) needed to handle CLONE_SIGHAND without CLONE_THREAD
    tasks, this is very unlikely (if ever used) case

    - unless we already have some serious problems, GFP_KERNEL allocation
    should not fail

    - ENOMEM still can happen after de_thread(), ->sighand is not the last
    object we have to allocate

    Change the code to allocate the new ->sighand on demand.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • There is no any reason to do recalc_sigpending() after changing ->sighand.
    To begin with, recalc_sigpending() does not take ->sighand into account.

    This means we don't need to take newsighand->siglock while changing sighands.
    rcu_assign_pointer() provides a necessary barrier, and if another process
    reads the new ->sighand it should either take tasklist_lock or it should use
    lock_task_sighand() which has a corresponding smp_read_barrier_depends().

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • vfs_permission(MAY_EXEC) checks if the filesystem is mounted with "noexec", so
    there's no need to repeat this check in sys_uselib() and open_exec().

    Signed-off-by: Miklos Szeredi
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Fix do_coredump to detect a crash in the user mode helper process and abort
    the attempt to recursively dump core to another copy of the helper process,
    potentially ad-infinitum.

    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: Neil Horman
    Cc:
    Cc:
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     
  • A rewrite of my previous post for this enhancement. It uses jeremy's
    split_argv/free_argv library functions to translate core_pattern into an argv
    array to be passed to the user mode helper process. It also adds a
    translation to format_corename such that the origional value of RLIMIT_CORE
    can be passed to userspace as an argument.

    Signed-off-by: Neil Horman
    Cc:
    Cc:
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     
  • For some time /proc/sys/kernel/core_pattern has been able to set its output
    destination as a pipe, allowing a user space helper to receive and
    intellegently process a core. This infrastructure however has some
    shortcommings which can be enhanced. Specifically:

    1) The coredump code in the kernel should ignore RLIMIT_CORE limitation
    when core_pattern is a pipe, since file system resources are not being
    consumed in this case, unless the user application wishes to save the core,
    at which point the app is restricted by usual file system limits and
    restrictions.

    2) The core_pattern code should be able to parse and pass options to the
    user space helper as an argv array. The real core limit of the uid of the
    crashing proces should also be passable to the user space helper (since it
    is overridden to zero when called).

    3) Some miscellaneous bugs need to be cleaned up (specifically the
    recognition of a recursive core dump, should the user mode helper itself
    crash. Also, the core dump code in the kernel should not wait for the user
    mode helper to exit, since the same context is responsible for writing to
    the pipe, and a read of the pipe by the user mode helper will result in a
    deadlock.

    This patch:

    Remove the check of RLIMIT_CORE if core_pattern is a pipe. In the event that
    core_pattern is a pipe, the entire core will be fed to the user mode helper.

    Signed-off-by: Neil Horman
    Cc:
    Cc:
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     
  • list_del() hardly can fail, so checking for return value is pointless
    (and current code always return 0).

    Nobody really cared that return value anyway.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Switch single-linked binfmt formats list to usual list_head's. This leads
    to one-liners in register_binfmt() and unregister_binfmt(). The downside
    is one pointer more in struct linux_binfmt. This is not a problem, since
    the set of registered binfmts on typical box is very small -- (ELF +
    something distro enabled for you).

    Test-booted, played with executable .txt files, modprobe/rmmod binfmt_misc.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

21 Sep, 2007

1 commit

  • This simplifies signalfd code, by avoiding it to remain attached to the
    sighand during its lifetime.

    In this way, the signalfd remain attached to the sighand only during
    poll(2) (and select and epoll) and read(2). This also allows to remove
    all the custom "tsk == current" checks in kernel/signal.c, since
    dequeue_signal() will only be called by "current".

    I think this is also what Ben was suggesting time ago.

    The external effect of this, is that a thread can extract only its own
    private signals and the group ones. I think this is an acceptable
    behaviour, in that those are the signals the thread would be able to
    fetch w/out signalfd.

    Signed-off-by: Davide Libenzi
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

23 Aug, 2007

2 commits

  • de_thread:

    if (atomic_read(&oldsighand->count) count) != 1);

    This is not safe without the rmb() in between. The results of two
    correctly ordered __exit_signal()->atomic_dec_and_test()'s could be seen
    out of order on our CPU.

    The same is true for the "thread_group_empty()" case, __unhash_process()'s
    changes could be seen before atomic_dec_and_test(&sig->count).

    On some platforms (including i386) atomic_read() doesn't provide even the
    compiler barrier, in that case these checks are simply racy.

    Remove these BUG_ON()'s. Alternatively, we can do something like

    BUG_ON( ({ smp_rmb(); atomic_read(&sig->count) != 1; }) );

    Signed-off-by: Oleg Nesterov
    Acked-by: Paul E. McKenney
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • With this patch any thread can dequeue its own private signals via signalfd,
    even if it was created by another sub-thread.

    To do so, we pass "current" to dequeue_signal() if the caller is from the same
    thread group. This also fixes the scheduling of posix timers broken by the
    previous patch.

    If the caller doesn't belong to this thread group, we can't handle __SI_TIMER
    case properly anyway. Perhaps we should forbid the cross-process signalfd usage
    and convert ctx->tsk to ctx->sighand.

    Signed-off-by: Oleg Nesterov
    Cc: Benjamin Herrenschmidt
    Cc: Davide Libenzi
    Cc: Ingo Molnar
    Cc: Michael Kerrisk
    Cc: Roland McGrath
    Cc: Thomas Gleixner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

19 Aug, 2007

1 commit


20 Jul, 2007

3 commits

  • This patch changes mm_struct.dumpable to a pair of bit flags.

    set_dumpable() converts three-value dumpable to two flags and stores it into
    lower two bits of mm_struct.flags instead of mm_struct.dumpable.
    get_dumpable() behaves in the opposite way.

    [akpm@linux-foundation.org: export set_dumpable]
    Signed-off-by: Hidehiro Kawai
    Cc: Alan Cox
    Cc: David Howells
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kawai, Hidehiro
     
  • Remove the arg+env limit of MAX_ARG_PAGES by copying the strings directly from
    the old mm into the new mm.

    We create the new mm before the binfmt code runs, and place the new stack at
    the very top of the address space. Once the binfmt code runs and figures out
    where the stack should be, we move it downwards.

    It is a bit peculiar in that we have one task with two mm's, one of which is
    inactive.

    [a.p.zijlstra@chello.nl: limit stack size]
    Signed-off-by: Ollie Wild
    Signed-off-by: Peter Zijlstra
    Cc:
    Cc: Hugh Dickins
    [bunk@stusta.de: unexport bprm_mm_init]
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ollie Wild
     
  • The purpose of audit_bprm() is to log the argv array to a userspace daemon at
    the end of the execve system call. Since user-space hasn't had time to run,
    this array is still in pristine state on the process' stack; so no need to
    copy it, we can just grab it from there.

    In order to minimize the damage to audit_log_*() copy each string into a
    temporary kernel buffer first.

    Currently the audit code requires that the full argument vector fits in a
    single packet. So currently it does clip the argv size to a (sysctl) limit,
    but only when execve auditing is enabled.

    If the audit protocol gets extended to allow for multiple packets this check
    can be removed.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ollie Wild
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

24 May, 2007

1 commit


17 May, 2007

1 commit


12 May, 2007

1 commit


11 May, 2007

1 commit

  • This patch series implements the new signalfd() system call.

    I took part of the original Linus code (and you know how badly it can be
    broken :), and I added even more breakage ;) Signals are fetched from the same
    signal queue used by the process, so signalfd will compete with standard
    kernel delivery in dequeue_signal(). If you want to reliably fetch signals on
    the signalfd file, you need to block them with sigprocmask(SIG_BLOCK). This
    seems to be working fine on my Dual Opteron machine. I made a quick test
    program for it:

    http://www.xmailserver.org/signafd-test.c

    The signalfd() system call implements signal delivery into a file descriptor
    receiver. The signalfd file descriptor if created with the following API:

    int signalfd(int ufd, const sigset_t *mask, size_t masksize);

    The "ufd" parameter allows to change an existing signalfd sigmask, w/out going
    to close/create cycle (Linus idea). Use "ufd" == -1 if you want a brand new
    signalfd file.

    The "mask" allows to specify the signal mask of signals that we are interested
    in. The "masksize" parameter is the size of "mask".

    The signalfd fd supports the poll(2) and read(2) system calls. The poll(2)
    will return POLLIN when signals are available to be dequeued. As a direct
    consequence of supporting the Linux poll subsystem, the signalfd fd can use
    used together with epoll(2) too.

    The read(2) system call will return a "struct signalfd_siginfo" structure in
    the userspace supplied buffer. The return value is the number of bytes copied
    in the supplied buffer, or -1 in case of error. The read(2) call can also
    return 0, in case the sighand structure to which the signalfd was attached,
    has been orphaned. The O_NONBLOCK flag is also supported, and read(2) will
    return -EAGAIN in case no signal is available.

    If the size of the buffer passed to read(2) is lower than sizeof(struct
    signalfd_siginfo), -EINVAL is returned. A read from the signalfd can also
    return -ERESTARTSYS in case a signal hits the process. The format of the
    struct signalfd_siginfo is, and the valid fields depends of the (->code &
    __SI_MASK) value, in the same way a struct siginfo would:

    struct signalfd_siginfo {
    __u32 signo; /* si_signo */
    __s32 err; /* si_errno */
    __s32 code; /* si_code */
    __u32 pid; /* si_pid */
    __u32 uid; /* si_uid */
    __s32 fd; /* si_fd */
    __u32 tid; /* si_fd */
    __u32 band; /* si_band */
    __u32 overrun; /* si_overrun */
    __u32 trapno; /* si_trapno */
    __s32 status; /* si_status */
    __s32 svint; /* si_int */
    __u64 svptr; /* si_ptr */
    __u64 utime; /* si_utime */
    __u64 stime; /* si_stime */
    __u64 addr; /* si_addr */
    };

    [akpm@linux-foundation.org: fix signalfd_copyinfo() on i386]
    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi