19 Aug, 2019

1 commit

  • My recent to change to only use force_sig for a synchronous events
    wound up breaking signal reception cifs and drbd. I had overlooked
    the fact that by default kthreads start out with all signals set to
    SIG_IGN. So a change I thought was safe turned out to have made it
    impossible for those kernel thread to catch their signals.

    Reverting the work on force_sig is a bad idea because what the code
    was doing was very much a misuse of force_sig. As the way force_sig
    ultimately allowed the signal to happen was to change the signal
    handler to SIG_DFL. Which after the first signal will allow userspace
    to send signals to these kernel threads. At least for
    wake_ack_receiver in drbd that does not appear actively wrong.

    So correct this problem by adding allow_kernel_signal that will allow
    signals whose siginfo reports they were sent by the kernel through,
    but will not allow userspace generated signals, and update cifs and
    drbd to call allow_kernel_signal in an appropriate place so that their
    thread can receive this signal.

    Fixing things this way ensures that userspace won't be able to send
    signals and cause problems, that it is clear which signals the
    threads are expecting to receive, and it guarantees that nothing
    else in the system will be affected.

    This change was partly inspired by similar cifs and drbd patches that
    added allow_signal.

    Reported-by: ronnie sahlberg
    Reported-by: Christoph Böhmwalder
    Tested-by: Christoph Böhmwalder
    Cc: Steve French
    Cc: Philipp Reisner
    Cc: David Laight
    Fixes: 247bc9470b1e ("cifs: fix rmmod regression in cifs.ko caused by force_sig changes")
    Fixes: 72abe3bcf091 ("signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig")
    Fixes: fee109901f39 ("signal/drbd: Use send_sig not force_sig")
    Fixes: 3cf5d076fb4d ("signal: Remove task parameter from force_sig")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

03 Aug, 2019

1 commit

  • The kernel-doc parser doesn't handle expressions with %foo*. Instead,
    when an asterisk should be part of a constant, it uses an alternative
    notation: `foo*`.

    Link: http://lkml.kernel.org/r/7f18c2e0b5e39e6b7eb55ddeb043b8b260b49f2d.1563361575.git.mchehab+samsung@kernel.org
    Signed-off-by: Mauro Carvalho Chehab
    Cc: Deepa Dinamani
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mauro Carvalho Chehab
     

29 Jul, 2019

1 commit

  • Previously a condition got missed where the pidfd waiters are awakened
    before the exit_state gets set. This can result in a missed notification
    [1] and the polling thread waiting forever.

    It is fixed now, however it would be nice to avoid this kind of issue
    going unnoticed in the future. So just add a warning to catch it in the
    future.

    /* References */
    [1]: https://lore.kernel.org/lkml/20190717172100.261204-1-joel@joelfernandes.org/

    Signed-off-by: Joel Fernandes (Google)
    Link: https://lore.kernel.org/r/20190724164816.201099-1-joel@joelfernandes.org
    Signed-off-by: Christian Brauner

    Joel Fernandes (Google)
     

17 Jul, 2019

1 commit

  • task->saved_sigmask and ->restore_sigmask are only used in the ret-from-
    syscall paths. This means that set_user_sigmask() can save ->blocked in
    ->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked
    was modified.

    This way the callers do not need 2 sigset_t's passed to set/restore and
    restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns
    into the trivial helper which just calls restore_saved_sigmask().

    Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.com
    Signed-off-by: Oleg Nesterov
    Cc: Deepa Dinamani
    Cc: Arnd Bergmann
    Cc: Jens Axboe
    Cc: Davidlohr Bueso
    Cc: Eric Wong
    Cc: Jason Baron
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc: Eric W. Biederman
    Cc: David Laight
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

11 Jul, 2019

1 commit

  • Pull pidfd updates from Christian Brauner:
    "This adds two main features.

    - First, it adds polling support for pidfds. This allows process
    managers to know when a (non-parent) process dies in a race-free
    way.

    The notification mechanism used follows the same logic that is
    currently used when the parent of a task is notified of a child's
    death. With this patchset it is possible to put pidfds in an
    {e}poll loop and get reliable notifications for process (i.e.
    thread-group) exit.

    - The second feature compliments the first one by making it possible
    to retrieve pollable pidfds for processes that were not created
    using CLONE_PIDFD.

    A lot of processes get created with traditional PID-based calls
    such as fork() or clone() (without CLONE_PIDFD). For these
    processes a caller can currently not create a pollable pidfd. This
    is a problem for Android's low memory killer (LMK) and service
    managers such as systemd.

    Both patchsets are accompanied by selftests.

    It's perhaps worth noting that the work done so far and the work done
    in this branch for pidfd_open() and polling support do already see
    some adoption:

    - Android is in the process of backporting this work to all their LTS
    kernels [1]

    - Service managers make use of pidfd_send_signal but will need to
    wait until we enable waiting on pidfds for full adoption.

    - And projects I maintain make use of both pidfd_send_signal and
    CLONE_PIDFD [2] and will use polling support and pidfd_open() too"

    [1] https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.9+backport%22
    https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.14+backport%22
    https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.19+backport%22

    [2] https://github.com/lxc/lxc/blob/aab6e3eb73c343231cdde775db938994fc6f2803/src/lxc/start.c#L1753

    * tag 'pidfd-updates-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    tests: add pidfd_open() tests
    arch: wire-up pidfd_open()
    pid: add pidfd_open()
    pidfd: add polling selftests
    pidfd: add polling support

    Linus Torvalds
     

09 Jul, 2019

2 commits

  • …iederm/user-namespace

    Pull force_sig() argument change from Eric Biederman:
    "A source of error over the years has been that force_sig has taken a
    task parameter when it is only safe to use force_sig with the current
    task.

    The force_sig function is built for delivering synchronous signals
    such as SIGSEGV where the userspace application caused a synchronous
    fault (such as a page fault) and the kernel responded with a signal.

    Because the name force_sig does not make this clear, and because the
    force_sig takes a task parameter the function force_sig has been
    abused for sending other kinds of signals over the years. Slowly those
    have been fixed when the oopses have been tracked down.

    This set of changes fixes the remaining abusers of force_sig and
    carefully rips out the task parameter from force_sig and friends
    making this kind of error almost impossible in the future"

    * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
    signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus
    signal: Remove the signal number and task parameters from force_sig_info
    signal: Factor force_sig_info_to_task out of force_sig_info
    signal: Generate the siginfo in force_sig
    signal: Move the computation of force into send_signal and correct it.
    signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal
    signal: Remove the task parameter from force_sig_fault
    signal: Use force_sig_fault_to_task for the two calls that don't deliver to current
    signal: Explicitly call force_sig_fault on current
    signal/unicore32: Remove tsk parameter from __do_user_fault
    signal/arm: Remove tsk parameter from __do_user_fault
    signal/arm: Remove tsk parameter from ptrace_break
    signal/nds32: Remove tsk parameter from send_sigtrap
    signal/riscv: Remove tsk parameter from do_trap
    signal/sh: Remove tsk parameter from force_sig_info_fault
    signal/um: Remove task parameter from send_sigtrap
    signal/x86: Remove task parameter from send_sigtrap
    signal: Remove task parameter from force_sig_mceerr
    signal: Remove task parameter from force_sig
    signal: Remove task parameter from force_sigsegv
    ...

    Linus Torvalds
     
  • Pull audit updates from Paul Moore:
    "This pull request is a bit early, but with some vacation time coming
    up I wanted to send this out now just in case the remote Internet Gods
    decide not to smile on me once the merge window opens. The patchset
    for v5.3 is pretty minor this time, the highlights include:

    - When the audit daemon is sent a signal, ensure we deliver
    information about the sender even when syscall auditing is not
    enabled/supported.

    - Add the ability to filter audit records based on network address
    family.

    - Tighten the audit field filtering restrictions on string based
    fields.

    - Cleanup the audit field filtering verification code.

    - Remove a few BUG() calls from the audit code"

    * tag 'audit-pr-20190702' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
    audit: remove the BUG() calls in the audit rule comparison functions
    audit: enforce op for string fields
    audit: add saddr_fam filter field
    audit: re-structure audit field valid checks
    audit: deliver signal_info regarless of syscall

    Linus Torvalds
     

29 Jun, 2019

1 commit

  • This is the minimal fix for stable, I'll send cleanups later.

    Commit 854a6ed56839 ("signal: Add restore_user_sigmask()") introduced
    the visible change which breaks user-space: a signal temporary unblocked
    by set_user_sigmask() can be delivered even if the caller returns
    success or timeout.

    Change restore_user_sigmask() to accept the additional "interrupted"
    argument which should be used instead of signal_pending() check, and
    update the callers.

    Eric said:

    : For clarity. I don't think this is required by posix, or fundamentally to
    : remove the races in select. It is what linux has always done and we have
    : applications who care so I agree this fix is needed.
    :
    : Further in any case where the semantic change that this patch rolls back
    : (aka where allowing a signal to be delivered and the select like call to
    : complete) would be advantage we can do as well if not better by using
    : signalfd.
    :
    : Michael is there any chance we can get this guarantee of the linux
    : implementation of pselect and friends clearly documented. The guarantee
    : that if the system call completes successfully we are guaranteed that no
    : signal that is unblocked by using sigmask will be delivered?

    Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com
    Fixes: 854a6ed56839a40f6b5d02a2962f48841482eec4 ("signal: Add restore_user_sigmask()")
    Signed-off-by: Oleg Nesterov
    Reported-by: Eric Wong
    Tested-by: Eric Wong
    Acked-by: "Eric W. Biederman"
    Acked-by: Arnd Bergmann
    Acked-by: Deepa Dinamani
    Cc: Michael Kerrisk
    Cc: Jens Axboe
    Cc: Davidlohr Bueso
    Cc: Jason Baron
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc: David Laight
    Cc: [5.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

28 Jun, 2019

1 commit

  • This patch adds polling support to pidfd.

    Android low memory killer (LMK) needs to know when a process dies once
    it is sent the kill signal. It does so by checking for the existence of
    /proc/pid which is both racy and slow. For example, if a PID is reused
    between when LMK sends a kill signal and checks for existence of the
    PID, since the wrong PID is now possibly checked for existence.
    Using the polling support, LMK will be able to get notified when a process
    exists in race-free and fast way, and allows the LMK to do other things
    (such as by polling on other fds) while awaiting the process being killed
    to die.

    For notification to polling processes, we follow the same existing
    mechanism in the kernel used when the parent of the task group is to be
    notified of a child's death (do_notify_parent). This is precisely when the
    tasks waiting on a poll of pidfd are also awakened in this patch.

    We have decided to include the waitqueue in struct pid for the following
    reasons:
    1. The wait queue has to survive for the lifetime of the poll. Including
    it in task_struct would not be option in this case because the task can
    be reaped and destroyed before the poll returns.

    2. By including the struct pid for the waitqueue means that during
    de_thread(), the new thread group leader automatically gets the new
    waitqueue/pid even though its task_struct is different.

    Appropriate test cases are added in the second patch to provide coverage of
    all the cases the patch is handling.

    Cc: Andy Lutomirski
    Cc: Steven Rostedt
    Cc: Daniel Colascione
    Cc: Jann Horn
    Cc: Tim Murray
    Cc: Jonathan Kowalski
    Cc: Linus Torvalds
    Cc: Al Viro
    Cc: Kees Cook
    Cc: David Howells
    Cc: Oleg Nesterov
    Cc: kernel-team@android.com
    Reviewed-by: Oleg Nesterov
    Co-developed-by: Daniel Colascione
    Signed-off-by: Daniel Colascione
    Signed-off-by: Joel Fernandes (Google)
    Signed-off-by: Christian Brauner

    Joel Fernandes (Google)
     

05 Jun, 2019

1 commit

  • Improve the comments for pidfd_send_signal().
    First, the comment still referred to a file descriptor for a process as a
    "task file descriptor" which stems from way back at the beginning of the
    discussion. Replace this with "pidfd" for consistency.
    Second, the wording for the explanation of the arguments to the syscall
    was a bit inconsistent, e.g. some used the past tense some used present
    tense. Make the wording more consistent.

    Signed-off-by: Christian Brauner

    Christian Brauner
     

02 Jun, 2019

1 commit

  • In the fixes commit, removing SIGKILL from each thread signal mask and
    executing "goto fatal" directly will skip the call to
    "trace_signal_deliver". At this point, the delivery tracking of the
    SIGKILL signal will be inaccurate.

    Therefore, we need to add trace_signal_deliver before "goto fatal" after
    executing sigdelset.

    Note: SEND_SIG_NOINFO matches the fact that SIGKILL doesn't have any info.

    Link: http://lkml.kernel.org/r/20190425025812.91424-1-weizhenliang@huawei.com
    Fixes: cf43a757fd4944 ("signal: Restore the stop PTRACE_EVENT_EXIT")
    Signed-off-by: Zhenliang Wei
    Reviewed-by: Christian Brauner
    Reviewed-by: Oleg Nesterov
    Cc: Eric W. Biederman
    Cc: Ivan Delalande
    Cc: Arnd Bergmann
    Cc: Thomas Gleixner
    Cc: Deepa Dinamani
    Cc: Greg Kroah-Hartman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhenliang Wei
     

29 May, 2019

7 commits

  • force_sig_info always delivers to the current task and the signal
    parameter always matches info.si_signo. So remove those parameters to
    make it a simpler less error prone interface, and to make it clear
    that none of the callers are doing anything clever.

    This guarantees that force_sig_info will not grow any new buggy
    callers that attempt to call force_sig on a non-current task, or that
    pass an signal number that does not match info.si_signo.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • All callers of force_sig_info pass info.si_signo in for the signal
    by definition as well as in practice.

    Further all callers of force_sig_info except force_sig_fault_to_task
    pass current as the target task to force_sig_info.

    Factor out a static force_sig_info_to_task that
    force_sig_fault_to_task can call.

    This prepares the way for force_sig_info to have it's task and signal
    parameters removed.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • In preparation for removing the special case in force_sig_info for
    only having a signal number generate an appropriate siginfo in
    force_sig the last caller of force_sig_info that does not
    pass a filled out siginfo.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Forcing a signal or not allowing a pid namespace init to ignore
    SIGKILL or SIGSTOP is more cleanly computed in send_signal.

    There are two cases where we don't allow a pid namespace init
    to ignore SIGKILL or SIGSTOP. If the sending process is
    from an ancestor pid namespace and as such is effectively
    the god to the target process, and if the it is the kernel
    that is sending the signal, not another application.

    It is known that a process is from an ancestor pid namespace if
    it can see it's target but it's target does not have a pid for
    the sender in it's pid namespace.

    It is know that a signal is sent from the kernel if si_code is set to
    SI_KERNEL or info is SEND_SIG_PRIV (which ultimately generates
    a signal with si_code == SI_KERNEL).

    The only signals that matter are SIGKILL and SIGSTOP neither of
    which can really be caught, and both of which always have a siginfo
    layout that includes si_uid and si_pid. Therefore we never need
    to worry about forcing a signal when si_pid and si_uid are absent.

    So handle the two special cases of info and the case when si_pid and
    si_uid are present.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Any time siginfo is not stored in the signal queue information is
    lost. Therefore set TRACE_SIGNAL_LOSE_INFO every time the code does
    not allocate a signal queue entry, and a queue overflow abort is not
    triggered.

    Fixes: ba005e1f4172 ("tracepoint: Add signal loss events")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • As synchronous exceptions really only make sense against the current
    task (otherwise how are you synchronous) remove the task parameter
    from from force_sig_fault to make it explicit that is what is going
    on.

    The two known exceptions that deliver a synchronous exception to a
    stopped ptraced task have already been changed to
    force_sig_fault_to_task.

    The callers have been changed with the following emacs regular expression
    (with obvious variations on the architectures that take more arguments)
    to avoid typos:

    force_sig_fault[(]\([^,]+\)[,]\([^,]+\)[,]\([^,]+\)[,]\W+current[)]
    ->
    force_sig_fault(\1,\2,\3)

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • In preparation for removing the task parameter from force_sig_fault
    introduce force_sig_fault_to_task and use it for the two cases where
    it matters.

    On mips force_fcr31_sig calls force_sig_fault and is called on either
    the current task, or a task that is suspended and is being switched to
    by the scheduler. This is safe because the task being switched to by
    the scheduler is guaranteed to be suspended. This ensures that
    task->sighand is stable while the signal is delivered to it.

    On parisc user_enable_single_step calls force_sig_fault and is in turn
    called by ptrace_request. The function ptrace_request always calls
    user_enable_single_step on a child that is stopped for tracing. The
    child being traced and not reaped ensures that child->sighand is not
    NULL, and that the child will not change child->sighand.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

27 May, 2019

3 commits


23 May, 2019

2 commits

  • The function send_signal was split from __send_signal so that it would
    be possible to bypass the namespace logic based upon current[1]. As it
    turns out the si_pid and the si_uid fixup are both inappropriate in
    the case of kill_pid_usb_asyncio so move that logic into send_signal.

    It is difficult to arrange but possible for a signal with an si_code
    of SI_TIMER or SI_SIGIO to be sent across namespace boundaries. In
    which case tests for when it is ok to change si_pid and si_uid based
    on SI_FROMUSER are incorrect. Replace the use of SI_FROMUSER with a
    new test has_si_pid_and_used based on siginfo_layout.

    Now that the uid fixup is no longer present after expanding
    SEND_SIG_NOINFO properly calculate the si_uid that the target
    task needs to read.

    [1] 7978b567d315 ("signals: add from_ancestor_ns parameter to send_signal()")
    Cc: stable@vger.kernel.org
    Fixes: 6588c1e3ff01 ("signals: SI_USER: Masquerade si_pid when crossing pid ns boundary")
    Fixes: 6b550f949594 ("user namespace: make signal.c respect user namespaces")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • The usb support for asyncio encoded one of it's values in the wrong
    field. It should have used si_value but instead used si_addr which is
    not present in the _rt union member of struct siginfo.

    The practical result of this is that on a 64bit big endian kernel
    when delivering a signal to a 32bit process the si_addr field
    is set to NULL, instead of the expected pointer value.

    This issue can not be fixed in copy_siginfo_to_user32 as the usb
    usage of the the _sigfault (aka si_addr) member of the siginfo
    union when SI_ASYNCIO is set is incompatible with the POSIX and
    glibc usage of the _rt member of the siginfo union.

    Therefore replace kill_pid_info_as_cred with kill_pid_usb_asyncio a
    dedicated function for this one specific case. There are no other
    users of kill_pid_info_as_cred so this specialization should have no
    impact on the amount of code in the kernel. Have kill_pid_usb_asyncio
    take instead of a siginfo_t which is difficult and error prone, 3
    arguments, a signal number, an errno value, and an address enconded as
    a sigval_t. The encoding of the address as a sigval_t allows the
    code that reads the userspace request for a signal to handle this
    compat issue along with all of the other compat issues.

    Add BUILD_BUG_ONs in kernel/signal.c to ensure that we can now place
    the pointer value at the in si_pid (instead of si_addr). That is the
    code now verifies that si_pid and si_addr always occur at the same
    location. Further the code veries that for native structures a value
    placed in si_pid and spilling into si_uid will appear in userspace in
    si_addr (on a byte by byte copy of siginfo or a field by field copy of
    siginfo). The code also verifies that for a 64bit kernel and a 32bit
    userspace the 32bit pointer will fit in si_pid.

    I have used the usbsig.c program below written by Alan Stern and
    slightly tweaked by me to run on a big endian machine to verify the
    issue exists (on sparc64) and to confirm the patch below fixes the issue.

    /* usbsig.c -- test USB async signal delivery */

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static struct usbdevfs_urb urb;
    static struct usbdevfs_disconnectsignal ds;
    static volatile sig_atomic_t done = 0;

    void urb_handler(int sig, siginfo_t *info , void *ucontext)
    {
    printf("Got signal %d, signo %d errno %d code %d addr: %p urb: %p\n",
    sig, info->si_signo, info->si_errno, info->si_code,
    info->si_addr, &urb);

    printf("%s\n", (info->si_addr == &urb) ? "Good" : "Bad");
    }

    void ds_handler(int sig, siginfo_t *info , void *ucontext)
    {
    printf("Got signal %d, signo %d errno %d code %d addr: %p ds: %p\n",
    sig, info->si_signo, info->si_errno, info->si_code,
    info->si_addr, &ds);

    printf("%s\n", (info->si_addr == &ds) ? "Good" : "Bad");
    done = 1;
    }

    int main(int argc, char **argv)
    {
    char *devfilename;
    int fd;
    int rc;
    struct sigaction act;
    struct usb_ctrlrequest *req;
    void *ptr;
    char buf[80];

    if (argc != 2) {
    fprintf(stderr, "Usage: usbsig device-file-name\n");
    return 1;
    }

    devfilename = argv[1];
    fd = open(devfilename, O_RDWR);
    if (fd == -1) {
    perror("Error opening device file");
    return 1;
    }

    act.sa_sigaction = urb_handler;
    sigemptyset(&act.sa_mask);
    act.sa_flags = SA_SIGINFO;

    rc = sigaction(SIGUSR1, &act, NULL);
    if (rc == -1) {
    perror("Error in sigaction");
    return 1;
    }

    act.sa_sigaction = ds_handler;
    sigemptyset(&act.sa_mask);
    act.sa_flags = SA_SIGINFO;

    rc = sigaction(SIGUSR2, &act, NULL);
    if (rc == -1) {
    perror("Error in sigaction");
    return 1;
    }

    memset(&urb, 0, sizeof(urb));
    urb.type = USBDEVFS_URB_TYPE_CONTROL;
    urb.endpoint = USB_DIR_IN | 0;
    urb.buffer = buf;
    urb.buffer_length = sizeof(buf);
    urb.signr = SIGUSR1;

    req = (struct usb_ctrlrequest *) buf;
    req->bRequestType = USB_DIR_IN | USB_TYPE_STANDARD | USB_RECIP_DEVICE;
    req->bRequest = USB_REQ_GET_DESCRIPTOR;
    req->wValue = htole16(USB_DT_DEVICE << 8);
    req->wIndex = htole16(0);
    req->wLength = htole16(sizeof(buf) - sizeof(*req));

    rc = ioctl(fd, USBDEVFS_SUBMITURB, &urb);
    if (rc == -1) {
    perror("Error in SUBMITURB ioctl");
    return 1;
    }

    rc = ioctl(fd, USBDEVFS_REAPURB, &ptr);
    if (rc == -1) {
    perror("Error in REAPURB ioctl");
    return 1;
    }

    memset(&ds, 0, sizeof(ds));
    ds.signr = SIGUSR2;
    ds.context = &ds;
    rc = ioctl(fd, USBDEVFS_DISCSIGNAL, &ds);
    if (rc == -1) {
    perror("Error in DISCSIGNAL ioctl");
    return 1;
    }

    printf("Waiting for usb disconnect\n");
    while (!done) {
    sleep(1);
    }

    close(fd);
    return 0;
    }

    Cc: Greg Kroah-Hartman
    Cc: linux-usb@vger.kernel.org
    Cc: Alan Stern
    Cc: Oliver Neukum
    Fixes: v2.3.39
    Cc: stable@vger.kernel.org
    Acked-by: Alan Stern
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

22 May, 2019

1 commit

  • When a process signals the audit daemon (shutdown, rotate, resume,
    reconfig) but syscall auditing is not enabled, we still want to know the
    identity of the process sending the signal to the audit daemon.

    Move audit_signal_info() out of syscall auditing to general auditing but
    create a new function audit_signal_info_syscall() to take care of the
    syscall dependent parts for when syscall auditing is enabled.

    Please see the github kernel audit issue
    https://github.com/linux-audit/audit-kernel/issues/111

    Signed-off-by: Richard Guy Briggs
    Signed-off-by: Paul Moore

    Richard Guy Briggs
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

17 May, 2019

1 commit

  • Alex Xu reported a regression in strace, caused by the introduction of
    the cgroup v2 freezer. The regression can be reproduced by stracing
    the following simple program:

    #include

    int main() {
    write(1, "a", 1);
    return 0;
    }

    An attempt to run strace ./a.out leads to the infinite loop:
    [ pre-main omitted ]
    write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
    write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
    write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
    write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
    write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
    write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
    [ repeats forever ]

    The problem occurs because the traced task leaves ptrace_stop()
    (and the signal handling loop) with the frozen bit set. So let's
    call cgroup_leave_frozen(true) unconditionally after sleeping
    in ptrace_stop().

    With this patch applied, strace works as expected:
    [ pre-main omitted ]
    write(1, "a", 1) = 1
    exit_group(0) = ?
    +++ exited with 0 +++

    Reported-by: Alex Xu
    Fixes: 76f969e8948d ("cgroup: cgroup v2 freezer")
    Signed-off-by: Roman Gushchin
    Acked-by: Oleg Nesterov
    Cc: Tejun Heo
    Signed-off-by: Tejun Heo

    Roman Gushchin
     

15 May, 2019

1 commit

  • There is a plan to build the kernel with -Wimplicit-fallthrough and this
    place in the code produced a warning (W=1).

    This commit remove the following warning:

    kernel/signal.c:795:13: warning: this statement may fall through [-Wimplicit-fallthrough=]

    Link: http://lkml.kernel.org/r/20190114203505.17875-1-malat@debian.org
    Signed-off-by: Mathieu Malaterre
    Acked-by: Gustavo A. R. Silva
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathieu Malaterre
     

10 May, 2019

1 commit

  • Pull cgroup updates from Tejun Heo:
    "This includes Roman's cgroup2 freezer implementation.

    It's a separate machanism from cgroup1 freezer. Instead of blocking
    user tasks in arbitrary uninterruptible sleeps, the new implementation
    extends jobctl stop - frozen tasks are trapped in jobctl stop until
    thawed and can be killed and ptraced. Lots of thanks to Oleg for
    sheperding the effort.

    Other than that, there are a few trivial changes"

    * 'for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: never call do_group_exit() with task->frozen bit set
    kernel: cgroup: fix misuse of %x
    cgroup: get rid of cgroup_freezer_frozen_exit()
    cgroup: prevent spurious transition into non-frozen state
    cgroup: Remove unused cgrp variable
    cgroup: document cgroup v2 freezer interface
    cgroup: add tracing points for cgroup v2 freezer
    cgroup: make TRACE_CGROUP_PATH irq-safe
    kselftests: cgroup: add freezer controller self-tests
    kselftests: cgroup: don't fail on cg_kill_all() error in cg_destroy()
    cgroup: cgroup v2 freezer
    cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock
    cgroup: implement __cgroup_task_count() helper
    cgroup: rename freezer.c into legacy_freezer.c
    cgroup: remove extra cgroup_migrate_finish() call

    Linus Torvalds
     

09 May, 2019

1 commit

  • I've got two independent reports that cgroup_task_frozen() check
    in cgroup_exit() has been triggered by lkp libhugetlbfs-test and
    LTP ptrace01 tests.

    For example:
    [ 44.576072] WARNING: CPU: 1 PID: 3028 at kernel/cgroup/cgroup.c:5932 cgroup_exit+0x148/0x160
    [ 44.577724] Modules linked in: crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sr_mod cdrom
    bochs_drm sg ttm ata_generic pata_acpi ppdev drm_kms_helper snd_pcm syscopyarea aesni_intel snd_timer
    sysfillrect sysimgblt snd crypto_simd cryptd glue_helper soundcore fb_sys_fops joydev drm serio_raw pcspkr
    ata_piix libata i2c_piix4 floppy parport_pc parport ip_tables
    [ 44.583106] CPU: 1 PID: 3028 Comm: ptrace-write-hu Not tainted 5.1.0-rc3-00053-g9262503 #5
    [ 44.584600] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
    [ 44.586116] RIP: 0010:cgroup_exit+0x148/0x160
    [ 44.587135] Code: 0f 84 50 ff ff ff 48 8b 85 c8 0c 00 00 48 8b 78 70 e8 ec 2e 00 00 e9 3b ff ff ff f0 ff 43 60
    0f 88 72 21 89 00 e9 48 ff ff ff 0b e9 1b ff ff ff e8 3c 73 f4 ff 66 90 66 2e 0f 1f 84 00 00 00
    [ 44.590113] RSP: 0018:ffffb25702dcfd30 EFLAGS: 00010002
    [ 44.591167] RAX: ffff96a7fee32410 RBX: ffff96a7ff1d6000 RCX: dead000000000200
    [ 44.592446] RDX: ffff96a7ff1d6080 RSI: ffff96a7fec75290 RDI: ffff96a7fec75290
    [ 44.593715] RBP: ffff96a7fec745c0 R08: ffff96a7fec74658 R09: 0000000000000000
    [ 44.594985] R10: 0000000000000000 R11: 0000000000000001 R12: ffff96a7fec75101
    [ 44.596266] R13: ffff96a7fec745c0 R14: ffff96a7ff3bde30 R15: ffff96a7fec75130
    [ 44.597550] FS: 0000000000000000(0000) GS:ffff96a7dd700000(0000) knlGS:0000000000000000
    [ 44.598950] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
    [ 44.600098] CR2: 00000000f7a00000 CR3: 000000000d20e000 CR4: 00000000000406e0
    [ 44.601417] Call Trace:
    [ 44.602777] do_exit+0x337/0xc40
    [ 44.603677] do_group_exit+0x3a/0xa0
    [ 44.604610] get_signal+0x12e/0x8d0
    [ 44.605533] ? __switch_to_asm+0x40/0x70
    [ 44.606503] do_signal+0x36/0x650
    [ 44.607409] ? __switch_to_asm+0x40/0x70
    [ 44.608383] ? __schedule+0x267/0x860
    [ 44.609329] exit_to_usermode_loop+0x89/0xf0
    [ 44.610349] do_fast_syscall_32+0x251/0x2e3
    [ 44.611357] entry_SYSENTER_compat+0x7f/0x91
    [ 44.612376] ---[ end trace e4ca5cfc4b7f7964 ]---

    The problem is caused by the ptrace_signal() call in the for loop
    in get_signal(). There is a cgroup_enter_frozen() call inside
    ptrace_signal(), so after exit from ptrace_signal() the task->frozen
    bit might be set. In this case do_group_exit() can be called with the
    task->frozen bit set and trigger the warning. This is only place where
    we can leave the loop with the task->frozen bit set and without
    setting JOBCTL_TRAP_FREEZE and TIF_SIGPENDING.

    To resolve this problem, let's move cgroup_leave_frozen(true) call to
    just after the fatal label. If the task is going to die, the frozen
    bit must be cleared no matter how we get into this point.

    Reported-by: kernel test robot
    Reported-by: Qian Cai
    Cc: Oleg Nesterov
    Cc: Tejun Heo
    Signed-off-by: Roman Gushchin
    Signed-off-by: Tejun Heo

    Roman Gushchin
     

08 May, 2019

1 commit

  • Pull pidfd updates from Christian Brauner:
    "This patchset makes it possible to retrieve pidfds at process creation
    time by introducing the new flag CLONE_PIDFD to the clone() system
    call. Linus originally suggested to implement this as a new flag to
    clone() instead of making it a separate system call.

    After a thorough review from Oleg CLONE_PIDFD returns pidfds in the
    parent_tidptr argument. This means we can give back the associated pid
    and the pidfd at the same time. Access to process metadata information
    thus becomes rather trivial.

    As has been agreed, CLONE_PIDFD creates file descriptors based on
    anonymous inodes similar to the new mount api. They are made
    unconditional by this patchset as they are now needed by core kernel
    code (vfs, pidfd) even more than they already were before (timerfd,
    signalfd, io_uring, epoll etc.). The core patchset is rather small.
    The bulky looking changelist is caused by David's very simple changes
    to Kconfig to make anon inodes unconditional.

    A pidfd comes with additional information in fdinfo if the kernel
    supports procfs. The fdinfo file contains the pid of the process in
    the callers pid namespace in the same format as the procfs status
    file, i.e. "Pid:\t%d".

    To remove worries about missing metadata access this patchset comes
    with a sample/test program that illustrates how a combination of
    CLONE_PIDFD and pidfd_send_signal() can be used to gain race-free
    access to process metadata through /proc/.

    Further work based on this patchset has been done by Joel. His work
    makes pidfds pollable. It finished too late for this merge window. I
    would prefer to have it sitting in linux-next for a while and send it
    for inclusion during the 5.3 merge window"

    * tag 'pidfd-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    samples: show race-free pidfd metadata access
    signal: support CLONE_PIDFD with pidfd_send_signal
    clone: add CLONE_PIDFD
    Make anon_inodes unconditional

    Linus Torvalds
     

07 May, 2019

1 commit

  • Let pidfd_send_signal() use pidfds retrieved via CLONE_PIDFD. With this
    patch pidfd_send_signal() becomes independent of procfs. This fullfils
    the request made when we merged the pidfd_send_signal() patchset. The
    pidfd_send_signal() syscall is now always available allowing for it to
    be used by users without procfs mounted or even users without procfs
    support compiled into the kernel.

    Signed-off-by: Christian Brauner
    Co-developed-by: Jann Horn
    Signed-off-by: Jann Horn
    Acked-by: Oleg Nesterov
    Cc: Arnd Bergmann
    Cc: "Eric W. Biederman"
    Cc: Kees Cook
    Cc: Thomas Gleixner
    Cc: David Howells
    Cc: "Michael Kerrisk (man-pages)"
    Cc: Andy Lutomirsky
    Cc: Andrew Morton
    Cc: Aleksa Sarai
    Cc: Linus Torvalds
    Cc: Al Viro

    Christian Brauner
     

06 May, 2019

1 commit

  • If freezing of a cgroup races with waking of a task from
    the frozen state (like waiting in vfork() or in do_signal_stop()),
    a spurious transition of the cgroup state can happen.

    The task enters cgroup_leave_frozen(true), the cgroup->nr_frozen_tasks
    counter decrements, and the cgroup is switched to the unfrozen state.

    To prevent it, let's reserve cgroup_leave_frozen(true) for
    terminating processes and use cgroup_leave_frozen(false) otherwise.

    To avoid busy-looping in the signal handling loop waiting
    for JOBCTL_TRAP_FREEZE set from the cgroup freezing path,
    let's do it explicitly in cgroup_leave_frozen(), if the task
    is going to stay frozen.

    Suggested-by: Oleg Nesterov
    Signed-off-by: Roman Gushchin
    Signed-off-by: Tejun Heo

    Roman Gushchin
     

20 Apr, 2019

1 commit

  • Cgroup v1 implements the freezer controller, which provides an ability
    to stop the workload in a cgroup and temporarily free up some
    resources (cpu, io, network bandwidth and, potentially, memory)
    for some other tasks. Cgroup v2 lacks this functionality.

    This patch implements freezer for cgroup v2.

    Cgroup v2 freezer tries to put tasks into a state similar to jobctl
    stop. This means that tasks can be killed, ptraced (using
    PTRACE_SEIZE*), and interrupted. It is possible to attach to
    a frozen task, get some information (e.g. read registers) and detach.
    It's also possible to migrate a frozen tasks to another cgroup.

    This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
    tried to imitate the system-wide freezer. However uninterruptible
    sleep is fine when all tasks are going to be frozen (hibernation case),
    it's not the acceptable state for some subset of the system.

    Cgroup v2 freezer is not supporting freezing kthreads.
    If a non-root cgroup contains kthread, the cgroup still can be frozen,
    but the kthread will remain running, the cgroup will be shown
    as non-frozen, and the notification will not be delivered.

    * PTRACE_ATTACH is not working because non-fatal signal delivery
    is blocked in frozen state.

    There are some interface differences between cgroup v1 and cgroup v2
    freezer too, which are required to conform the cgroup v2 interface
    design principles:
    1) There is no separate controller, which has to be turned on:
    the functionality is always available and is represented by
    cgroup.freeze and cgroup.events cgroup control files.
    2) The desired state is defined by the cgroup.freeze control file.
    Any hierarchical configuration is allowed.
    3) The interface is asynchronous. The actual state is available
    using cgroup.events control file ("frozen" field). There are no
    dedicated transitional states.
    4) It's allowed to make any changes with the cgroup hierarchy
    (create new cgroups, remove old cgroups, move tasks between cgroups)
    no matter if some cgroups are frozen.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Tejun Heo
    No-objection-from-me-by: Oleg Nesterov
    Cc: kernel-team@fb.com

    Roman Gushchin
     

18 Apr, 2019

1 commit

  • As stated in the original commit for pidfd_send_signal() we don't allow
    to signal processes through O_PATH file descriptors since it is
    semantically equivalent to a write on the pidfd.

    We already correctly error out right now and return EBADF if an O_PATH
    fd is passed. This is because we use file->f_op to detect whether a
    pidfd is passed and O_PATH fds have their file->f_op set to empty_fops
    in do_dentry_open() and thus fail the test.

    Thus, there is no regression. It's just semantically correct to use
    fdget() and return an error right from there instead of taking a
    reference and returning an error later.

    Signed-off-by: Christian Brauner
    Acked-by: Oleg Nesterov
    Cc: Arnd Bergmann
    Cc: "Eric W. Biederman"
    Cc: Kees Cook
    Cc: Thomas Gleixner
    Cc: Jann Horn
    Cc: David Howells
    Cc: "Michael Kerrisk (man-pages)"
    Cc: Andy Lutomirsky
    Cc: Andrew Morton
    Cc: Oleg Nesterov
    Cc: Aleksa Sarai
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Christian Brauner
     

02 Apr, 2019

1 commit

  • The current sys_pidfd_send_signal() silently turns signals with explicit
    SI_USER context that are sent to non-current tasks into signals with
    kernel-generated siginfo.
    This is unlike do_rt_sigqueueinfo(), which returns -EPERM in this case.
    If a user actually wants to send a signal with kernel-provided siginfo,
    they can do that with pidfd_send_signal(pidfd, sig, NULL, 0); so allowing
    this case is unnecessary.

    Instead of silently replacing the siginfo, just bail out with an error;
    this is consistent with other interfaces and avoids special-casing behavior
    based on security checks.

    Fixes: 3eb39f47934f ("signal: add pidfd_send_signal() syscall")
    Signed-off-by: Jann Horn
    Signed-off-by: Christian Brauner

    Jann Horn
     

17 Mar, 2019

1 commit

  • Pull pidfd system call from Christian Brauner:
    "This introduces the ability to use file descriptors from /proc//
    as stable handles on struct pid. Even if a pid is recycled the handle
    will not change. For a start these fds can be used to send signals to
    the processes they refer to.

    With the ability to use /proc/ fds as stable handles on struct
    pid we can fix a long-standing issue where after a process has exited
    its pid can be reused by another process. If a caller sends a signal
    to a reused pid it will end up signaling the wrong process.

    With this patchset we enable a variety of use cases. One obvious
    example is that we can now safely delegate an important part of
    process management - sending signals - to processes other than the
    parent of a given process by sending file descriptors around via scm
    rights and not fearing that the given process will have been recycled
    in the meantime. It also allows for easy testing whether a given
    process is still alive or not by sending signal 0 to a pidfd which is
    quite handy.

    There has been some interest in this feature e.g. from systems
    management (systemd, glibc) and container managers. I have requested
    and gotten comments from glibc to make sure that this syscall is
    suitable for their needs as well. In the future I expect it to take on
    most other pid-based signal syscalls. But such features are left for
    the future once they are needed.

    This has been sitting in linux-next for quite a while and has not
    caused any issues. It comes with selftests which verify basic
    functionality and also test that a recycled pid cannot be signaled via
    a pidfd.

    Jon has written about a prior version of this patchset. It should
    cover the basic functionality since not a lot has changed since then:

    https://lwn.net/Articles/773459/

    The commit message for the syscall itself is extensively documenting
    the syscall, including it's functionality and extensibility"

    * tag 'pidfd-v5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    selftests: add tests for pidfd_send_signal()
    signal: add pidfd_send_signal() syscall

    Linus Torvalds
     

06 Mar, 2019

2 commits

  • Pull year 2038 updates from Thomas Gleixner:
    "Another round of changes to make the kernel ready for 2038. After lots
    of preparatory work this is the first set of syscalls which are 2038
    safe:

    403 clock_gettime64
    404 clock_settime64
    405 clock_adjtime64
    406 clock_getres_time64
    407 clock_nanosleep_time64
    408 timer_gettime64
    409 timer_settime64
    410 timerfd_gettime64
    411 timerfd_settime64
    412 utimensat_time64
    413 pselect6_time64
    414 ppoll_time64
    416 io_pgetevents_time64
    417 recvmmsg_time64
    418 mq_timedsend_time64
    419 mq_timedreceiv_time64
    420 semtimedop_time64
    421 rt_sigtimedwait_time64
    422 futex_time64
    423 sched_rr_get_interval_time64

    The syscall numbers are identical all over the architectures"

    * 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
    riscv: Use latest system call ABI
    checksyscalls: fix up mq_timedreceive and stat exceptions
    unicore32: Fix __ARCH_WANT_STAT64 definition
    asm-generic: Make time32 syscall numbers optional
    asm-generic: Drop getrlimit and setrlimit syscalls from default list
    32-bit userspace ABI: introduce ARCH_32BIT_OFF_T config option
    compat ABI: use non-compat openat and open_by_handle_at variants
    y2038: add 64-bit time_t syscalls to all 32-bit architectures
    y2038: rename old time and utime syscalls
    y2038: remove struct definition redirects
    y2038: use time32 syscall names on 32-bit
    syscalls: remove obsolete __IGNORE_ macros
    y2038: syscalls: rename y2038 compat syscalls
    x86/x32: use time64 versions of sigtimedwait and recvmmsg
    timex: change syscalls to use struct __kernel_timex
    timex: use __kernel_timex internally
    sparc64: add custom adjtimex/clock_adjtime functions
    time: fix sys_timer_settime prototype
    time: Add struct __kernel_timex
    time: make adjtime compat handling available for 32 bit
    ...

    Linus Torvalds
     
  • The kill() syscall operates on process identifiers (pid). After a process
    has exited its pid can be reused by another process. If a caller sends a
    signal to a reused pid it will end up signaling the wrong process. This
    issue has often surfaced and there has been a push to address this problem [1].

    This patch uses file descriptors (fd) from proc/ as stable handles on
    struct pid. Even if a pid is recycled the handle will not change. The fd
    can be used to send signals to the process it refers to.
    Thus, the new syscall pidfd_send_signal() is introduced to solve this
    problem. Instead of pids it operates on process fds (pidfd).

    /* prototype and argument /*
    long pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags);

    /* syscall number 424 */
    The syscall number was chosen to be 424 to align with Arnd's rework in his
    y2038 to minimize merge conflicts (cf. [25]).

    In addition to the pidfd and signal argument it takes an additional
    siginfo_t and flags argument. If the siginfo_t argument is NULL then
    pidfd_send_signal() is equivalent to kill(, ). If it
    is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo().
    The flags argument is added to allow for future extensions of this syscall.
    It currently needs to be passed as 0. Failing to do so will cause EINVAL.

    /* pidfd_send_signal() replaces multiple pid-based syscalls */
    The pidfd_send_signal() syscall currently takes on the job of
    rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a
    positive pid is passed to kill(2). It will however be possible to also
    replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended.

    /* sending signals to threads (tid) and process groups (pgid) */
    Specifically, the pidfd_send_signal() syscall does currently not operate on
    process groups or threads. This is left for future extensions.
    In order to extend the syscall to allow sending signal to threads and
    process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and
    PIDFD_TYPE_TID) should be added. This implies that the flags argument will
    determine what is signaled and not the file descriptor itself. Put in other
    words, grouping in this api is a property of the flags argument not a
    property of the file descriptor (cf. [13]). Clarification for this has been
    requested by Eric (cf. [19]).
    When appropriate extensions through the flags argument are added then
    pidfd_send_signal() can additionally replace the part of kill(2) which
    operates on process groups as well as the tgkill(2) and
    rt_tgsigqueueinfo(2) syscalls.
    How such an extension could be implemented has been very roughly sketched
    in [14], [15], and [16]. However, this should not be taken as a commitment
    to a particular implementation. There might be better ways to do it.
    Right now this is intentionally left out to keep this patchset as simple as
    possible (cf. [4]).

    /* naming */
    The syscall had various names throughout iterations of this patchset:
    - procfd_signal()
    - procfd_send_signal()
    - taskfd_send_signal()
    In the last round of reviews it was pointed out that given that if the
    flags argument decides the scope of the signal instead of different types
    of fds it might make sense to either settle for "procfd_" or "pidfd_" as
    prefix. The community was willing to accept either (cf. [17] and [18]).
    Given that one developer expressed strong preference for the "pidfd_"
    prefix (cf. [13]) and with other developers less opinionated about the name
    we should settle for "pidfd_" to avoid further bikeshedding.

    The "_send_signal" suffix was chosen to reflect the fact that the syscall
    takes on the job of multiple syscalls. It is therefore intentional that the
    name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the
    fomer because it might imply that pidfd_send_signal() is a replacement for
    kill(2), and not the latter because it is a hassle to remember the correct
    spelling - especially for non-native speakers - and because it is not
    descriptive enough of what the syscall actually does. The name
    "pidfd_send_signal" makes it very clear that its job is to send signals.

    /* zombies */
    Zombies can be signaled just as any other process. No special error will be
    reported since a zombie state is an unreliable state (cf. [3]). However,
    this can be added as an extension through the @flags argument if the need
    ever arises.

    /* cross-namespace signals */
    The patch currently enforces that the signaler and signalee either are in
    the same pid namespace or that the signaler's pid namespace is an ancestor
    of the signalee's pid namespace. This is done for the sake of simplicity
    and because it is unclear to what values certain members of struct
    siginfo_t would need to be set to (cf. [5], [6]).

    /* compat syscalls */
    It became clear that we would like to avoid adding compat syscalls
    (cf. [7]). The compat syscall handling is now done in kernel/signal.c
    itself by adding __copy_siginfo_from_user_generic() which lets us avoid
    compat syscalls (cf. [8]). It should be noted that the addition of
    __copy_siginfo_from_user_any() is caused by a bug in the original
    implementation of rt_sigqueueinfo(2) (cf. 12).
    With upcoming rework for syscall handling things might improve
    significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain
    any additional callers.

    /* testing */
    This patch was tested on x64 and x86.

    /* userspace usage */
    An asciinema recording for the basic functionality can be found under [9].
    With this patch a process can be killed via:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
    unsigned int flags)
    {
    #ifdef __NR_pidfd_send_signal
    return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
    #else
    return -ENOSYS;
    #endif
    }

    int main(int argc, char *argv[])
    {
    int fd, ret, saved_errno, sig;

    if (argc < 3)
    exit(EXIT_FAILURE);

    fd = open(argv[1], O_DIRECTORY | O_CLOEXEC);
    if (fd < 0) {
    printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]);
    exit(EXIT_FAILURE);
    }

    sig = atoi(argv[2]);

    printf("Sending signal %d to process %s\n", sig, argv[1]);
    ret = do_pidfd_send_signal(fd, sig, NULL, 0);

    saved_errno = errno;
    close(fd);
    errno = saved_errno;

    if (ret < 0) {
    printf("%s - Failed to send signal %d to process %s\n",
    strerror(errno), sig, argv[1]);
    exit(EXIT_FAILURE);
    }

    exit(EXIT_SUCCESS);
    }

    /* Q&A
    * Given that it seems the same questions get asked again by people who are
    * late to the party it makes sense to add a Q&A section to the commit
    * message so it's hopefully easier to avoid duplicate threads.
    *
    * For the sake of progress please consider these arguments settled unless
    * there is a new point that desperately needs to be addressed. Please make
    * sure to check the links to the threads in this commit message whether
    * this has not already been covered.
    */
    Q-01: (Florian Weimer [20], Andrew Morton [21])
    What happens when the target process has exited?
    A-01: Sending the signal will fail with ESRCH (cf. [22]).

    Q-02: (Andrew Morton [21])
    Is the task_struct pinned by the fd?
    A-02: No. A reference to struct pid is kept. struct pid - as far as I
    understand - was created exactly for the reason to not require to
    pin struct task_struct (cf. [22]).

    Q-03: (Andrew Morton [21])
    Does the entire procfs directory remain visible? Just one entry
    within it?
    A-03: The same thing that happens right now when you hold a file descriptor
    to /proc/ open (cf. [22]).

    Q-04: (Andrew Morton [21])
    Does the pid remain reserved?
    A-04: No. This patchset guarantees a stable handle not that pids are not
    recycled (cf. [22]).

    Q-05: (Andrew Morton [21])
    Do attempts to signal that fd return errors?
    A-05: See {Q,A}-01.

    Q-06: (Andrew Morton [22])
    Is there a cleaner way of obtaining the fd? Another syscall perhaps.
    A-06: Userspace can already trivially retrieve file descriptors from procfs
    so this is something that we will need to support anyway. Hence,
    there's no immediate need to add another syscalls just to make
    pidfd_send_signal() not dependent on the presence of procfs. However,
    adding a syscalls to get such file descriptors is planned for a
    future patchset (cf. [22]).

    Q-07: (Andrew Morton [21] and others)
    This fd-for-a-process sounds like a handy thing and people may well
    think up other uses for it in the future, probably unrelated to
    signals. Are the code and the interface designed to permit such
    future applications?
    A-07: Yes (cf. [22]).

    Q-08: (Andrew Morton [21] and others)
    Now I think about it, why a new syscall? This thing is looking
    rather like an ioctl?
    A-08: This has been extensively discussed. It was agreed that a syscall is
    preferred for a variety or reasons. Here are just a few taken from
    prior threads. Syscalls are safer than ioctl()s especially when
    signaling to fds. Processes are a core kernel concept so a syscall
    seems more appropriate. The layout of the syscall with its four
    arguments would require the addition of a custom struct for the
    ioctl() thereby causing at least the same amount or even more
    complexity for userspace than a simple syscall. The new syscall will
    replace multiple other pid-based syscalls (see description above).
    The file-descriptors-for-processes concept introduced with this
    syscall will be extended with other syscalls in the future. See also
    [22], [23] and various other threads already linked in here.

    Q-09: (Florian Weimer [24])
    What happens if you use the new interface with an O_PATH descriptor?
    A-09:
    pidfds opened as O_PATH fds cannot be used to send signals to a
    process (cf. [2]). Signaling processes through pidfds is the
    equivalent of writing to a file. Thus, this is not an operation that
    operates "purely at the file descriptor level" as required by the
    open(2) manpage. See also [4].

    /* References */
    [1]: https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/
    [2]: https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/
    [3]: https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/
    [4]: https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/
    [5]: https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/
    [6]: https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/
    [7]: https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/
    [8]: https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/
    [9]: https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy
    [11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/
    [12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/
    [13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/
    [14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/
    [15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/
    [16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/
    [17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/
    [18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/
    [19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/
    [20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
    [21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/
    [22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/
    [23]: https://lwn.net/Articles/773459/
    [24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
    [25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/

    Cc: "Eric W. Biederman"
    Cc: Jann Horn
    Cc: Andy Lutomirsky
    Cc: Andrew Morton
    Cc: Oleg Nesterov
    Cc: Al Viro
    Cc: Florian Weimer
    Signed-off-by: Christian Brauner
    Reviewed-by: Tycho Andersen
    Reviewed-by: Kees Cook
    Reviewed-by: David Howells
    Acked-by: Arnd Bergmann
    Acked-by: Thomas Gleixner
    Acked-by: Serge Hallyn
    Acked-by: Aleksa Sarai

    Christian Brauner
     

13 Feb, 2019

1 commit

  • In the middle of do_exit() there is there is a call
    "ptrace_event(PTRACE_EVENT_EXIT, code);" That call places the process
    in TACKED_TRACED aka "(TASK_WAKEKILL | __TASK_TRACED)" and waits for
    for the debugger to release the task or SIGKILL to be delivered.

    Skipping past dequeue_signal when we know a fatal signal has already
    been delivered resulted in SIGKILL remaining pending and
    TIF_SIGPENDING remaining set. This in turn caused the
    scheduler to not sleep in PTACE_EVENT_EXIT as it figured
    a fatal signal was pending. This also caused ptrace_freeze_traced
    in ptrace_check_attach to fail because it left a per thread
    SIGKILL pending which is what fatal_signal_pending tests for.

    This difference in signal state caused strace to report
    strace: Exit of unknown pid NNNNN ignored

    Therefore update the signal handling state like dequeue_signal
    would when removing a per thread SIGKILL, by removing SIGKILL
    from the per thread signal mask and clearing TIF_SIGPENDING.

    Acked-by: Oleg Nesterov
    Reported-by: Oleg Nesterov
    Reported-by: Ivan Delalande
    Cc: stable@vger.kernel.org
    Fixes: 35634ffa1751 ("signal: Always notice exiting tasks")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

07 Feb, 2019

1 commit

  • Recently syzkaller was able to create unkillablle processes by
    creating a timer that is delivered as a thread local signal on SIGHUP,
    and receiving SIGHUP SA_NODEFERER. Ultimately causing a loop failing
    to deliver SIGHUP but always trying.

    When the stack overflows delivery of SIGHUP fails and force_sigsegv is
    called. Unfortunately because SIGSEGV is numerically higher than
    SIGHUP next_signal tries again to deliver a SIGHUP.

    From a quality of implementation standpoint attempting to deliver the
    timer SIGHUP signal is wrong. We should attempt to deliver the
    synchronous SIGSEGV signal we just forced.

    We can make that happening in a fairly straight forward manner by
    instead of just looking at the signal number we also look at the
    si_code. In particular for exceptions (aka synchronous signals) the
    si_code is always greater than 0.

    That still has the potential to pick up a number of asynchronous
    signals as in a few cases the same si_codes that are used
    for synchronous signals are also used for asynchronous signals,
    and SI_KERNEL is also included in the list of possible si_codes.

    Still the heuristic is much better and timer signals are definitely
    excluded. Which is enough to prevent all known ways for someone
    sending a process signals fast enough to cause unexpected and
    arguably incorrect behavior.

    Cc: stable@vger.kernel.org
    Fixes: a27341cd5fcb ("Prioritize synchronous signals over 'normal' signals")
    Tested-by: Dmitry Vyukov
    Reported-by: Dmitry Vyukov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman