28 May, 2010

2 commits

  • Change zap_other_threads() to return the number of other sub-threads found
    on ->thread_group list.

    Other changes are cosmetic:

    - change the code to use while_each_thread() helper

    - remove the obsolete comment about SIGKILL/SIGSTOP

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Veaceslav Falico
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Andrew Tridgell reports that aio_read(SIGEV_SIGNAL) can fail if the
    notification from the helper thread races with setresuid(), see
    http://samba.org/~tridge/junkcode/aio_uid.c

    This happens because check_kill_permission() doesn't permit sending a
    signal to the task with the different cred->xids. But there is not any
    security reason to check ->cred's when the task sends a signal (private or
    group-wide) to its sub-thread. Whatever we do, any thread can bypass all
    security checks and send SIGKILL to all threads, or it can block a signal
    SIG and do kill(gettid(), SIG) to deliver this signal to another
    sub-thread. Not to mention that CLONE_THREAD implies CLONE_VM.

    Change check_kill_permission() to avoid the credentials check when the
    sender and the target are from the same thread group.

    Also, move "cred = current_cred()" down to avoid calling get_current()
    twice.

    Note: David Howells pointed out we could relax this even more, the
    CLONE_SIGHAND (without CLONE_THREAD) case probably does not need
    these checks too.

    Roland said:
    : The glibc (libpthread) that does set*id across threads has
    : been in use for a while (2.3.4?), probably in distro's using kernels as old
    : or older than any active -stable streams. In the race in question, this
    : kernel bug is breaking valid POSIX application expectations.

    Reported-by: Andrew Tridgell
    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Acked-by: David Howells
    Cc: Eric Paris
    Cc: Jakub Jelinek
    Cc: James Morris
    Cc: Roland McGrath
    Cc: Stephen Smalley
    Cc: [all kernel versions]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

21 May, 2010

1 commit

  • This patch contains the hooks and instrumentation into kernel which
    live outside the kernel/debug directory, which the kdb core
    will call to run commands like lsmod, dmesg, bt etc...

    CC: linux-arch@vger.kernel.org
    Signed-off-by: Jason Wessel
    Signed-off-by: Martin Hicks

    Jason Wessel
     

07 Mar, 2010

1 commit

  • Make sure compiler won't do weird things with limits. E.g. fetching them
    twice may return 2 different values after writable limits are implemented.

    I.e. either use rlimit helpers added in commit 3e10e716abf3 ("resource:
    add helpers for fetching rlimits") or ACCESS_ONCE if not applicable.

    Signed-off-by: Jiri Slaby
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: john stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     

04 Mar, 2010

1 commit

  • This makes sure that we pick the synchronous signals caused by a
    processor fault over any pending regular asynchronous signals sent to
    use by [t]kill().

    This is not strictly required semantics, but it makes it _much_ easier
    for programs like Wine that expect to find the fault information in the
    signal stack.

    Without this, if a non-synchronous signal gets picked first, the delayed
    asynchronous signal will have its signal context pointing to the new
    signal invocation, rather than the instruction that caused the SIGSEGV
    or SIGBUS in the first place.

    This is not all that pretty, and we're discussing making the synchronous
    signals more explicit rather than have these kinds of implicit
    preferences of SIGSEGV and friends. See for example

    http://bugzilla.kernel.org/show_bug.cgi?id=15395

    for some of the discussion. But in the meantime this is a simple and
    fairly straightforward work-around, and the whole

    if (x & Y)
    x &= Y;

    thing can be compiled into (and gcc does do it) just three instructions:

    movq %rdx, %rax
    andl $Y, %eax
    cmovne %rax, %rdx

    so it is at least a simple solution to a subtle issue.

    Reported-and-tested-by: Pavel Vilim
    Acked-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

12 Jan, 2010

1 commit

  • When print-fatal-signals is enabled it's possible to dump any memory
    reachable by the kernel to the log by simply jumping to that address from
    user space.

    Or crash the system if there's some hardware with read side effects.

    The fatal signals handler will dump 16 bytes at the execution address,
    which is fully controlled by ring 3.

    In addition when something jumps to a unmapped address there will be up to
    16 additional useless page faults, which might be potentially slow (and at
    least is not very efficient)

    Fortunately this option is off by default and only there on i386.

    But fix it by checking for kernel addresses and also stopping when there's
    a page fault.

    Signed-off-by: Andi Kleen
    Cc: Ingo Molnar
    Cc: Oleg Nesterov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

20 Dec, 2009

1 commit


16 Dec, 2009

5 commits

  • Move the call to do_signal_stop() down, after tracehook call. This makes
    ->group_stop_count condition visible to tracers before do_signal_stop()
    will participate in this group-stop.

    Currently the patch has no effect, tracehook_get_signal() always returns 0.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Kill force_sig_specific(), this trivial wrapper has no callers.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Trivial, s/0/SI_USER/ in collect_signal() for grep.

    This is a bit confusing, we don't know the source of this signal.
    But we don't care, and "info->si_code = 0" is imho worse.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Change send_signal() to use si_fromuser(). From now SEND_SIG_NOINFO
    triggers the "from_ancestor_ns" check.

    This fixes reparent_thread()->group_send_sig_info(pdeath_signal)
    behaviour, before this patch send_signal() does not detect the
    cross-namespace case when the child of the dying parent belongs to the
    sub-namespace.

    This patch can affect the behaviour of send_sig(), kill_pgrp() and
    kill_pid() when the caller sends the signal to the sub-namespace with
    "priv == 0" but surprisingly all callers seem to use them correctly,
    including disassociate_ctty(on_exit).

    Except: drivers/staging/comedi/drivers/addi-data/*.c incorrectly use
    send_sig(priv => 0). But his is minor and should be fixed anyway.

    Reported-by: Daniel Lezcano
    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Reviewed-by: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • No changes in compiled code. The patch adds the new helper, si_fromuser()
    and changes check_kill_permission() to use this helper.

    The real effect of this patch is that from now we "officially" consider
    SEND_SIG_NOINFO signal as "from user-space" signals. This is already true
    if we look at the code which uses SEND_SIG_NOINFO, except __send_signal()
    has another opinion - see the next patch.

    The naming of these special SEND_SIG_XXX siginfo's is really bad
    imho. From __send_signal()'s pov they mean

    SEND_SIG_NOINFO from user
    SEND_SIG_PRIV from kernel
    SEND_SIG_FORCED no info

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Reviewed-by: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

11 Dec, 2009

2 commits

  • 1) Remove the misleading comment in __sigqueue_alloc() which claims
    that holding a spinlock is equivalent to rcu_read_lock().

    2) Add a rcu_read_lock/unlock around the __task_cred() access
    in __sigqueue_alloc()

    This needs to be revisited to remove the remaining users of
    read_lock(&tasklist_lock) but that's outside the scope of this patch.

    Signed-off-by: Thomas Gleixner
    LKML-Reference:

    Thomas Gleixner
     
  • kill_pid_info_as_uid() accesses __task_cred() without being in a RCU
    read side critical section. tasklist_lock is not protecting that when
    CONFIG_TREE_PREEMPT_RCU=y.

    Convert the whole tasklist_lock section to rcu and use
    lock_task_sighand to prevent the exit race.

    Signed-off-by: Thomas Gleixner
    LKML-Reference:
    Acked-by: Oleg Nesterov

    Thomas Gleixner
     

06 Dec, 2009

1 commit

  • …git/tip/linux-2.6-tip

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (470 commits)
    x86: Fix comments of register/stack access functions
    perf tools: Replace %m with %a in sscanf
    hw-breakpoints: Keep track of user disabled breakpoints
    tracing/syscalls: Make syscall events print callbacks static
    tracing: Add DEFINE_EVENT(), DEFINE_SINGLE_EVENT() support to docbook
    perf: Don't free perf_mmap_data until work has been done
    perf_event: Fix compile error
    perf tools: Fix _GNU_SOURCE macro related strndup() build error
    trace_syscalls: Remove unused syscall_name_to_nr()
    trace_syscalls: Simplify syscall profile
    trace_syscalls: Remove duplicate init_enter_##sname()
    trace_syscalls: Add syscall_nr field to struct syscall_metadata
    trace_syscalls: Remove enter_id exit_id
    trace_syscalls: Set event_enter_##sname->data to its metadata
    trace_syscalls: Remove unused event_syscall_enter and event_syscall_exit
    perf_event: Initialize data.period in perf_swevent_hrtimer()
    perf probe: Simplify event naming
    perf probe: Add --list option for listing current probe events
    perf probe: Add argv_split() from lib/argv_split.c
    perf probe: Move probe event utility functions to probe-event.c
    ...

    Linus Torvalds
     

26 Nov, 2009

3 commits

  • Add signal_overflow_fail and signal_lose_info tracepoints
    for signal-lost events.

    Changes in v3:
    - Add docbook style comments

    Changes in v2:
    - Use siginfo string macro

    Suggested-by: Roland McGrath
    Reviewed-by: Jason Baron
    Signed-off-by: Masami Hiramatsu
    Acked-by: Roland McGrath
    Cc: systemtap
    Cc: DLE
    Cc: Oleg Nesterov
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Masami Hiramatsu
     
  • Add a tracepoint where a process gets a signal. This tracepoint
    shows signal-number, sa-handler and sa-flag.

    Changes in v3:
    - Add docbook style comments

    Changes in v2:
    - Add siginfo argument
    - Fix comment

    Signed-off-by: Masami Hiramatsu
    Reviewed-by: Jason Baron
    Acked-by: Roland McGrath
    Cc: systemtap
    Cc: DLE
    Cc: Oleg Nesterov
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Masami Hiramatsu
     
  • Move signal sending event to events/signal.h. This patch also
    renames sched_signal_send event to signal_generate.

    Changes in v4:
    - Fix a typo of task_struct pointer.

    Changes in v3:
    - Add docbook style comments

    Changes in v2:
    - Add siginfo argument
    - Add siginfo storing macro

    Signed-off-by: Masami Hiramatsu
    Reviewed-by: Jason Baron
    Acked-by: Roland McGrath
    Cc: systemtap
    Cc: DLE
    Cc: Oleg Nesterov
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Masami Hiramatsu
     

09 Nov, 2009

1 commit

  • When the system has too many timers or too many aggregate
    queued signals, the EAGAIN error is returned to application
    from kernel, including timer_create() [POSIX.1b].

    It means that the app exceeded the limit of pending signals,
    but in general application writers do not expect this
    outcome and the current silent failure can cause rare app
    failures under very high load.

    This patch adds a new message when we reach the limit
    and if print_fatal_signals is enabled:

    task/1234: reached RLIMIT_SIGPENDING, dropping signal

    If you see this message and your system behaved unexpectedly,
    you can run following command to lift the limit:

    # ulimit -i unlimited

    With help from Hiroshi Shimamoto .

    Signed-off-by: Naohiro Ooiwa
    Cc: Andrew Morton
    Cc: Hiroshi Shimamoto
    Cc: Roland McGrath
    Cc: Peter Zijlstra
    Cc: oleg@redhat.com
    LKML-Reference:
    [ Modified a few small details, gave surrounding code some love. ]
    Signed-off-by: Ingo Molnar

    Naohiro Ooiwa
     

24 Sep, 2009

4 commits

  • __fatal_signal_pending inlines to one instruction on x86, probably two
    instructions on other machines. It takes two longer x86 instructions just
    to call it and test its return value, not to mention the function itself.

    On my random x86_64 config, this saved 70 bytes of text (59 of those being
    __fatal_signal_pending itself).

    Signed-off-by: Roland McGrath
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • Introduce do_send_sig_info() and convert group_send_sig_info(),
    send_sig_info(), do_send_specific() to use this helper.

    Hopefully it will have more users soon, it allows to specify
    specific/group behaviour via "bool group" argument.

    Shaves 80 bytes from .text.

    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: stephane eranian
    Cc: Ingo Molnar
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • This changes tracehook_notify_jctl() so it's called with the siglock held,
    and changes its argument and return value definition. These clean-ups
    make it a better fit for what new tracing hooks need to check.

    Tracing needs the siglock here, held from the time TASK_STOPPED was set,
    to avoid potential SIGCONT races if it wants to allow any blocking in its
    tracing hooks.

    This also folds the finish_stop() function into its caller
    do_signal_stop(). The function is short, called only once and only
    unconditionally. It aids readability to fold it in.

    [oleg@redhat.com: do not call tracehook_notify_jctl() in TASK_STOPPED state]
    [oleg@redhat.com: introduce tracehook_finish_jctl() helper]
    Signed-off-by: Roland McGrath
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • The bug is old, it wasn't cause by recent changes.

    Test case:

    static void *tfunc(void *arg)
    {
    int pid = (long)arg;

    assert(ptrace(PTRACE_ATTACH, pid, NULL, NULL) == 0);
    kill(pid, SIGKILL);

    sleep(1);
    return NULL;
    }

    int main(void)
    {
    pthread_t th;
    long pid = fork();

    if (!pid)
    pause();

    signal(SIGCHLD, SIG_IGN);
    assert(pthread_create(&th, NULL, tfunc, (void*)pid) == 0);

    int r = waitpid(-1, NULL, __WNOTHREAD);
    printf("waitpid: %d %m\n", r);

    return 0;
    }

    Before the patch this program hangs, after this patch waitpid() correctly
    fails with errno == -ECHILD.

    The problem is, __ptrace_detach() reaps the EXIT_ZOMBIE tracee if its
    ->real_parent is our sub-thread and we ignore SIGCHLD. But in this case
    we should wake up other threads which can sleep in do_wait().

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Vitaly Mayatskikh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

02 Aug, 2009

2 commits

  • The previous commit ("do_sigaltstack: avoid copying 'stack_t' as a
    structure to user space") fixed a real bug. This one just cleans up the
    copy from user space to that gcc can generate better code for it (and so
    that it looks the same as the later copy back to user space).

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Ulrich Drepper correctly points out that there is generally padding in
    the structure on 64-bit hosts, and that copying the structure from
    kernel to user space can leak information from the kernel stack in those
    padding bytes.

    Avoid the whole issue by just copying the three members one by one
    instead, which also means that the function also can avoid the need for
    a stack frame. This also happens to match how we copy the new structure
    from user space, so it all even makes sense.

    [ The obvious solution of adding a memset() generates horrid code, gcc
    does really stupid things. ]

    Reported-by: Ulrich Drepper
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

19 Jun, 2009

2 commits

  • If the non-traced sub-thread calls do_notify_parent_cldstop(), we send the
    notification to group_leader->real_parent and we report group_leader's
    pid.

    But, if group_leader is traced we use the wrong ->parent->nsproxy->pid_ns,
    the tracer and parent can live in different namespaces. Change the code
    to use "parent" instead of tsk->parent.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Acked-by: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • No functional changes.

    - Nobody except ptrace.c & co should use ptrace flags directly, we have
    task_ptrace() for that.

    - No need to specially check PT_PTRACED, we must not have other PT_ bits
    set without PT_PTRACED. And no need to know this flag exists.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

15 Jun, 2009

1 commit

  • This false positive is due to field padding in struct sigqueue. When
    this dynamically allocated structure is copied to the stack (in arch-
    specific delivery code), kmemcheck sees a read from the padding, which
    is, naturally, uninitialized.

    Hide the false positive using the __GFP_NOTRACK_FALSE_POSITIVE flag.
    Also made the rlimit override code a bit clearer by introducing a new
    variable.

    Cc: Oleg Nesterov
    Signed-off-by: Vegard Nossum

    Vegard Nossum
     

12 Jun, 2009

1 commit

  • …s/security-testing-2.6

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (44 commits)
    nommu: Provide mmap_min_addr definition.
    TOMOYO: Add description of lists and structures.
    TOMOYO: Remove unused field.
    integrity: ima audit dentry_open failure
    TOMOYO: Remove unused parameter.
    security: use mmap_min_addr indepedently of security models
    TOMOYO: Simplify policy reader.
    TOMOYO: Remove redundant markers.
    SELinux: define audit permissions for audit tree netlink messages
    TOMOYO: Remove unused mutex.
    tomoyo: avoid get+put of task_struct
    smack: Remove redundant initialization.
    integrity: nfsd imbalance bug fix
    rootplug: Remove redundant initialization.
    smack: do not beyond ARRAY_SIZE of data
    integrity: move ima_counts_get
    integrity: path_check update
    IMA: Add __init notation to ima functions
    IMA: Minimal IMA policy and boot param for TCB IMA policy
    selinux: remove obsolete read buffer limit from sel_read_bool
    ...

    Linus Torvalds
     

11 Jun, 2009

1 commit

  • * 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (244 commits)
    Revert "x86, bts: reenable ptrace branch trace support"
    tracing: do not translate event helper macros in print format
    ftrace/documentation: fix typo in function grapher name
    tracing/events: convert block trace points to TRACE_EVENT(), fix !CONFIG_BLOCK
    tracing: add protection around module events unload
    tracing: add trace_seq_vprint interface
    tracing: fix the block trace points print size
    tracing/events: convert block trace points to TRACE_EVENT()
    ring-buffer: fix ret in rb_add_time_stamp
    ring-buffer: pass in lockdep class key for reader_lock
    tracing: add annotation to what type of stack trace is recorded
    tracing: fix multiple use of __print_flags and __print_symbolic
    tracing/events: fix output format of user stack
    tracing/events: fix output format of kernel stack
    tracing/trace_stack: fix the number of entries in the header
    ring-buffer: discard timestamps that are at the start of the buffer
    ring-buffer: try to discard unneeded timestamps
    ring-buffer: fix bug in ring_buffer_discard_commit
    ftrace: do not profile functions when disabled
    tracing: make trace pipe recognize latency format flag
    ...

    Linus Torvalds
     

08 May, 2009

1 commit


01 May, 2009

2 commits

  • sys_kill has the per thread counterpart sys_tgkill. sigqueueinfo is
    missing a thread directed counterpart. Such an interface is important
    for migrating applications from other OSes which have the per thread
    delivery implemented.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Acked-by: Ulrich Drepper

    Thomas Gleixner
     
  • Split out the code from do_tkill to make it reusable by the follow up
    patch which implements sys_rt_tgsigqueueinfo

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Oleg Nesterov

    Thomas Gleixner
     

30 Apr, 2009

1 commit

  • Don't flush inherited SIGKILL during execve() in SELinux's post cred commit
    hook. This isn't really a security problem: if the SIGKILL came before the
    credentials were changed, then we were right to receive it at the time, and
    should honour it; if it came after the creds were changed, then we definitely
    should honour it; and in any case, all that will happen is that the process
    will be scrapped before it ever returns to userspace.

    Signed-off-by: David Howells
    Signed-off-by: Oleg Nesterov
    Signed-off-by: James Morris

    David Howells
     

15 Apr, 2009

2 commits

  • Impact: clean up

    Create a sub directory in include/trace called events to keep the
    trace point headers in their own separate directory. Only headers that
    declare trace points should be defined in this directory.

    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Neil Horman
    Cc: Zhao Lei
    Cc: Eduard - Gabriel Munteanu
    Cc: Pekka Enberg
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • This patch lowers the number of places a developer must modify to add
    new tracepoints. The current method to add a new tracepoint
    into an existing system is to write the trace point macro in the
    trace header with one of the macros TRACE_EVENT, TRACE_FORMAT or
    DECLARE_TRACE, then they must add the same named item into the C file
    with the macro DEFINE_TRACE(name) and then add the trace point.

    This change cuts out the needing to add the DEFINE_TRACE(name).
    Every file that uses the tracepoint must still include the trace/.h
    file, but the one C file must also add a define before the including
    of that file.

    #define CREATE_TRACE_POINTS
    #include

    This will cause the trace/mytrace.h file to also produce the C code
    necessary to implement the trace point.

    Note, if more than one trace/.h is used to create the C code
    it is best to list them all together.

    #define CREATE_TRACE_POINTS
    #include
    #include
    #include

    Thanks to Mathieu Desnoyers and Christoph Hellwig for coming up with
    the cleaner solution of the define above the includes over my first
    design to have the C code include a "special" header.

    This patch converts sched, irq and lockdep and skb to use this new
    method.

    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Neil Horman
    Cc: Zhao Lei
    Cc: Eduard - Gabriel Munteanu
    Cc: Pekka Enberg
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

03 Apr, 2009

4 commits

  • When sending a signal to a descendant namespace, set ->si_pid to 0 since
    the sender does not have a pid in the receiver's namespace.

    Note:
    - If rt_sigqueueinfo() sets si_code to SI_USER when sending a
    signal across a pid namespace boundary, the value in ->si_pid
    will be cleared to 0.

    Signed-off-by: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Roland McGrath
    Cc: "Eric W. Biederman"
    Cc: Daniel Lezcano
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • Normally SIG_DFL signals to global and container-init are dropped early.
    But if a signal is blocked when it is posted, we cannot drop the signal
    since the receiver may install a handler before unblocking the signal.
    Once this signal is queued however, the receiver container-init has no way
    of knowing if the signal was sent from an ancestor or descendant
    namespace. This patch ensures that contianer-init drops all SIG_DFL
    signals in get_signal_to_deliver() except SIGKILL/SIGSTOP.

    If SIGSTOP/SIGKILL originate from a descendant of container-init they are
    never queued (i.e dropped in sig_ignored() in an earler patch).

    If SIGSTOP/SIGKILL originate from parent namespace, the signal is queued
    and container-init processes the signal.

    IOW, if get_signal_to_deliver() sees a sig_kernel_only() signal for global
    or container-init, the signal must have been generated internally or must
    have come from an ancestor ns and we process the signal.

    Further, the signal_group_exit() check was needed to cover the case of a
    multi-threaded init sending SIGKILL to other threads when doing an exit()
    or exec(). But since the new sig_kernel_only() check covers the SIGKILL,
    the signal_group_exit() check is no longer needed and can be removed.

    Finally, now that we have all pieces in place, set SIGNAL_UNKILLABLE for
    container-inits.

    Signed-off-by: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Roland McGrath
    Cc: "Eric W. Biederman"
    Cc: Daniel Lezcano
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • Drop early any SIG_DFL or SIG_IGN signals to container-init from within
    the same container. But queue SIGSTOP and SIGKILL to the container-init
    if they are from an ancestor container.

    Blocked, fatal signals (i.e when SIG_DFL is to terminate) from within the
    container can still terminate the container-init. That will be addressed
    in the next patch.

    Note: To be bisect-safe, SIGNAL_UNKILLABLE will be set for container-inits
    in a follow-on patch. Until then, this patch is just a preparatory
    step.

    Signed-off-by: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Roland McGrath
    Cc: "Eric W. Biederman"
    Cc: Daniel Lezcano
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • send_signal() (or its helper) needs to determine the pid namespace of the
    sender. But a signal sent via kill_pid_info_as_uid() comes from within
    the kernel and send_signal() does not need to determine the pid namespace
    of the sender. So define a helper for send_signal() which takes an
    additional parameter, 'from_ancestor_ns' and have kill_pid_info_as_uid()
    use that helper directly.

    The 'from_ancestor_ns' parameter will be used in a follow-on patch.

    Signed-off-by: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Roland McGrath
    Cc: "Eric W. Biederman"
    Cc: Daniel Lezcano
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu