11 Oct, 2019

1 commit


17 Sep, 2019

1 commit

  • Pull pidfd/waitid updates from Christian Brauner:
    "This contains two features and various tests.

    First, it adds support for waiting on process through pidfds by adding
    the P_PIDFD type to the waitid() syscall. This completes the basic
    functionality of the pidfd api (cf. [1]). In the meantime we also have
    a new adition to the userspace projects that make use of the pidfd
    api. The qt project was nice enough to send a mail pointing out that
    they have a pr up to switch to the pidfd api (cf. [2]).

    Second, this tag contains an extension to the waitid() syscall to make
    it possible to wait on the current process group in a race free manner
    (even though the actual problem is very unlikely) by specifing 0
    together with the P_PGID type. This extension traces back to a
    discussion on the glibc development mailing list.

    There are also a range of tests for the features above. Additionally,
    the test-suite which detected the pidfd-polling race we fixed in [3]
    is included in this tag"

    [1] https://lwn.net/Articles/794707/
    [2] https://codereview.qt-project.org/c/qt/qtbase/+/108456
    [3] commit b191d6491be6 ("pidfd: fix a poll race when setting exit_state")

    * tag 'core-process-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    waitid: Add support for waiting for the current process group
    tests: add pidfd poll tests
    tests: move common definitions and functions into pidfd.h
    pidfd: add pidfd_wait tests
    pidfd: add P_PIDFD to waitid()

    Linus Torvalds
     

19 Aug, 2019

1 commit

  • My recent to change to only use force_sig for a synchronous events
    wound up breaking signal reception cifs and drbd. I had overlooked
    the fact that by default kthreads start out with all signals set to
    SIG_IGN. So a change I thought was safe turned out to have made it
    impossible for those kernel thread to catch their signals.

    Reverting the work on force_sig is a bad idea because what the code
    was doing was very much a misuse of force_sig. As the way force_sig
    ultimately allowed the signal to happen was to change the signal
    handler to SIG_DFL. Which after the first signal will allow userspace
    to send signals to these kernel threads. At least for
    wake_ack_receiver in drbd that does not appear actively wrong.

    So correct this problem by adding allow_kernel_signal that will allow
    signals whose siginfo reports they were sent by the kernel through,
    but will not allow userspace generated signals, and update cifs and
    drbd to call allow_kernel_signal in an appropriate place so that their
    thread can receive this signal.

    Fixing things this way ensures that userspace won't be able to send
    signals and cause problems, that it is clear which signals the
    threads are expecting to receive, and it guarantees that nothing
    else in the system will be affected.

    This change was partly inspired by similar cifs and drbd patches that
    added allow_signal.

    Reported-by: ronnie sahlberg
    Reported-by: Christoph Böhmwalder
    Tested-by: Christoph Böhmwalder
    Cc: Steve French
    Cc: Philipp Reisner
    Cc: David Laight
    Fixes: 247bc9470b1e ("cifs: fix rmmod regression in cifs.ko caused by force_sig changes")
    Fixes: 72abe3bcf091 ("signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig")
    Fixes: fee109901f39 ("signal/drbd: Use send_sig not force_sig")
    Fixes: 3cf5d076fb4d ("signal: Remove task parameter from force_sig")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

03 Aug, 2019

1 commit

  • The kernel-doc parser doesn't handle expressions with %foo*. Instead,
    when an asterisk should be part of a constant, it uses an alternative
    notation: `foo*`.

    Link: http://lkml.kernel.org/r/7f18c2e0b5e39e6b7eb55ddeb043b8b260b49f2d.1563361575.git.mchehab+samsung@kernel.org
    Signed-off-by: Mauro Carvalho Chehab
    Cc: Deepa Dinamani
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mauro Carvalho Chehab
     

02 Aug, 2019

1 commit

  • This adds the P_PIDFD type to waitid().
    One of the last remaining bits for the pidfd api is to make it possible
    to wait on pidfds. With P_PIDFD added to waitid() the parts of userspace
    that want to use the pidfd api to exclusively manage processes can do so
    now.

    One of the things this will unblock in the future is the ability to make
    it possible to retrieve the exit status via waitid(P_PIDFD) for
    non-parent processes if handed a _suitable_ pidfd that has this feature
    set. This is similar to what you can do on FreeBSD with kqueue(). It
    might even end up being possible to wait on a process as a non-parent if
    an appropriate property is enabled on the pidfd.

    With P_PIDFD no scoping of the process identified by the pidfd is
    possible, i.e. it explicitly blocks things such as wait4(-1), wait4(0),
    waitid(P_ALL), waitid(P_PGID) etc. It only allows for semantics
    equivalent to wait4(pid), waitid(P_PID). Users that need scoping should
    rely on pid-based wait*() syscalls for now.

    Signed-off-by: Christian Brauner
    Reviewed-by: Kees Cook
    Reviewed-by: Oleg Nesterov
    Cc: Arnd Bergmann
    Cc: "Eric W. Biederman"
    Cc: Joel Fernandes (Google)
    Cc: Thomas Gleixner
    Cc: David Howells
    Cc: Jann Horn
    Cc: Andy Lutomirsky
    Cc: Andrew Morton
    Cc: Aleksa Sarai
    Cc: Linus Torvalds
    Cc: Al Viro
    Link: https://lore.kernel.org/r/20190727222229.6516-2-christian@brauner.io

    Christian Brauner
     

29 Jul, 2019

1 commit

  • Previously a condition got missed where the pidfd waiters are awakened
    before the exit_state gets set. This can result in a missed notification
    [1] and the polling thread waiting forever.

    It is fixed now, however it would be nice to avoid this kind of issue
    going unnoticed in the future. So just add a warning to catch it in the
    future.

    /* References */
    [1]: https://lore.kernel.org/lkml/20190717172100.261204-1-joel@joelfernandes.org/

    Signed-off-by: Joel Fernandes (Google)
    Link: https://lore.kernel.org/r/20190724164816.201099-1-joel@joelfernandes.org
    Signed-off-by: Christian Brauner

    Joel Fernandes (Google)
     

17 Jul, 2019

1 commit

  • task->saved_sigmask and ->restore_sigmask are only used in the ret-from-
    syscall paths. This means that set_user_sigmask() can save ->blocked in
    ->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked
    was modified.

    This way the callers do not need 2 sigset_t's passed to set/restore and
    restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns
    into the trivial helper which just calls restore_saved_sigmask().

    Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.com
    Signed-off-by: Oleg Nesterov
    Cc: Deepa Dinamani
    Cc: Arnd Bergmann
    Cc: Jens Axboe
    Cc: Davidlohr Bueso
    Cc: Eric Wong
    Cc: Jason Baron
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc: Eric W. Biederman
    Cc: David Laight
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

11 Jul, 2019

1 commit

  • Pull pidfd updates from Christian Brauner:
    "This adds two main features.

    - First, it adds polling support for pidfds. This allows process
    managers to know when a (non-parent) process dies in a race-free
    way.

    The notification mechanism used follows the same logic that is
    currently used when the parent of a task is notified of a child's
    death. With this patchset it is possible to put pidfds in an
    {e}poll loop and get reliable notifications for process (i.e.
    thread-group) exit.

    - The second feature compliments the first one by making it possible
    to retrieve pollable pidfds for processes that were not created
    using CLONE_PIDFD.

    A lot of processes get created with traditional PID-based calls
    such as fork() or clone() (without CLONE_PIDFD). For these
    processes a caller can currently not create a pollable pidfd. This
    is a problem for Android's low memory killer (LMK) and service
    managers such as systemd.

    Both patchsets are accompanied by selftests.

    It's perhaps worth noting that the work done so far and the work done
    in this branch for pidfd_open() and polling support do already see
    some adoption:

    - Android is in the process of backporting this work to all their LTS
    kernels [1]

    - Service managers make use of pidfd_send_signal but will need to
    wait until we enable waiting on pidfds for full adoption.

    - And projects I maintain make use of both pidfd_send_signal and
    CLONE_PIDFD [2] and will use polling support and pidfd_open() too"

    [1] https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.9+backport%22
    https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.14+backport%22
    https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.19+backport%22

    [2] https://github.com/lxc/lxc/blob/aab6e3eb73c343231cdde775db938994fc6f2803/src/lxc/start.c#L1753

    * tag 'pidfd-updates-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    tests: add pidfd_open() tests
    arch: wire-up pidfd_open()
    pid: add pidfd_open()
    pidfd: add polling selftests
    pidfd: add polling support

    Linus Torvalds
     

09 Jul, 2019

2 commits

  • …iederm/user-namespace

    Pull force_sig() argument change from Eric Biederman:
    "A source of error over the years has been that force_sig has taken a
    task parameter when it is only safe to use force_sig with the current
    task.

    The force_sig function is built for delivering synchronous signals
    such as SIGSEGV where the userspace application caused a synchronous
    fault (such as a page fault) and the kernel responded with a signal.

    Because the name force_sig does not make this clear, and because the
    force_sig takes a task parameter the function force_sig has been
    abused for sending other kinds of signals over the years. Slowly those
    have been fixed when the oopses have been tracked down.

    This set of changes fixes the remaining abusers of force_sig and
    carefully rips out the task parameter from force_sig and friends
    making this kind of error almost impossible in the future"

    * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
    signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus
    signal: Remove the signal number and task parameters from force_sig_info
    signal: Factor force_sig_info_to_task out of force_sig_info
    signal: Generate the siginfo in force_sig
    signal: Move the computation of force into send_signal and correct it.
    signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal
    signal: Remove the task parameter from force_sig_fault
    signal: Use force_sig_fault_to_task for the two calls that don't deliver to current
    signal: Explicitly call force_sig_fault on current
    signal/unicore32: Remove tsk parameter from __do_user_fault
    signal/arm: Remove tsk parameter from __do_user_fault
    signal/arm: Remove tsk parameter from ptrace_break
    signal/nds32: Remove tsk parameter from send_sigtrap
    signal/riscv: Remove tsk parameter from do_trap
    signal/sh: Remove tsk parameter from force_sig_info_fault
    signal/um: Remove task parameter from send_sigtrap
    signal/x86: Remove task parameter from send_sigtrap
    signal: Remove task parameter from force_sig_mceerr
    signal: Remove task parameter from force_sig
    signal: Remove task parameter from force_sigsegv
    ...

    Linus Torvalds
     
  • Pull audit updates from Paul Moore:
    "This pull request is a bit early, but with some vacation time coming
    up I wanted to send this out now just in case the remote Internet Gods
    decide not to smile on me once the merge window opens. The patchset
    for v5.3 is pretty minor this time, the highlights include:

    - When the audit daemon is sent a signal, ensure we deliver
    information about the sender even when syscall auditing is not
    enabled/supported.

    - Add the ability to filter audit records based on network address
    family.

    - Tighten the audit field filtering restrictions on string based
    fields.

    - Cleanup the audit field filtering verification code.

    - Remove a few BUG() calls from the audit code"

    * tag 'audit-pr-20190702' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
    audit: remove the BUG() calls in the audit rule comparison functions
    audit: enforce op for string fields
    audit: add saddr_fam filter field
    audit: re-structure audit field valid checks
    audit: deliver signal_info regarless of syscall

    Linus Torvalds
     

29 Jun, 2019

1 commit

  • This is the minimal fix for stable, I'll send cleanups later.

    Commit 854a6ed56839 ("signal: Add restore_user_sigmask()") introduced
    the visible change which breaks user-space: a signal temporary unblocked
    by set_user_sigmask() can be delivered even if the caller returns
    success or timeout.

    Change restore_user_sigmask() to accept the additional "interrupted"
    argument which should be used instead of signal_pending() check, and
    update the callers.

    Eric said:

    : For clarity. I don't think this is required by posix, or fundamentally to
    : remove the races in select. It is what linux has always done and we have
    : applications who care so I agree this fix is needed.
    :
    : Further in any case where the semantic change that this patch rolls back
    : (aka where allowing a signal to be delivered and the select like call to
    : complete) would be advantage we can do as well if not better by using
    : signalfd.
    :
    : Michael is there any chance we can get this guarantee of the linux
    : implementation of pselect and friends clearly documented. The guarantee
    : that if the system call completes successfully we are guaranteed that no
    : signal that is unblocked by using sigmask will be delivered?

    Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com
    Fixes: 854a6ed56839a40f6b5d02a2962f48841482eec4 ("signal: Add restore_user_sigmask()")
    Signed-off-by: Oleg Nesterov
    Reported-by: Eric Wong
    Tested-by: Eric Wong
    Acked-by: "Eric W. Biederman"
    Acked-by: Arnd Bergmann
    Acked-by: Deepa Dinamani
    Cc: Michael Kerrisk
    Cc: Jens Axboe
    Cc: Davidlohr Bueso
    Cc: Jason Baron
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc: David Laight
    Cc: [5.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

28 Jun, 2019

1 commit

  • This patch adds polling support to pidfd.

    Android low memory killer (LMK) needs to know when a process dies once
    it is sent the kill signal. It does so by checking for the existence of
    /proc/pid which is both racy and slow. For example, if a PID is reused
    between when LMK sends a kill signal and checks for existence of the
    PID, since the wrong PID is now possibly checked for existence.
    Using the polling support, LMK will be able to get notified when a process
    exists in race-free and fast way, and allows the LMK to do other things
    (such as by polling on other fds) while awaiting the process being killed
    to die.

    For notification to polling processes, we follow the same existing
    mechanism in the kernel used when the parent of the task group is to be
    notified of a child's death (do_notify_parent). This is precisely when the
    tasks waiting on a poll of pidfd are also awakened in this patch.

    We have decided to include the waitqueue in struct pid for the following
    reasons:
    1. The wait queue has to survive for the lifetime of the poll. Including
    it in task_struct would not be option in this case because the task can
    be reaped and destroyed before the poll returns.

    2. By including the struct pid for the waitqueue means that during
    de_thread(), the new thread group leader automatically gets the new
    waitqueue/pid even though its task_struct is different.

    Appropriate test cases are added in the second patch to provide coverage of
    all the cases the patch is handling.

    Cc: Andy Lutomirski
    Cc: Steven Rostedt
    Cc: Daniel Colascione
    Cc: Jann Horn
    Cc: Tim Murray
    Cc: Jonathan Kowalski
    Cc: Linus Torvalds
    Cc: Al Viro
    Cc: Kees Cook
    Cc: David Howells
    Cc: Oleg Nesterov
    Cc: kernel-team@android.com
    Reviewed-by: Oleg Nesterov
    Co-developed-by: Daniel Colascione
    Signed-off-by: Daniel Colascione
    Signed-off-by: Joel Fernandes (Google)
    Signed-off-by: Christian Brauner

    Joel Fernandes (Google)
     

05 Jun, 2019

1 commit

  • Improve the comments for pidfd_send_signal().
    First, the comment still referred to a file descriptor for a process as a
    "task file descriptor" which stems from way back at the beginning of the
    discussion. Replace this with "pidfd" for consistency.
    Second, the wording for the explanation of the arguments to the syscall
    was a bit inconsistent, e.g. some used the past tense some used present
    tense. Make the wording more consistent.

    Signed-off-by: Christian Brauner

    Christian Brauner
     

02 Jun, 2019

1 commit

  • In the fixes commit, removing SIGKILL from each thread signal mask and
    executing "goto fatal" directly will skip the call to
    "trace_signal_deliver". At this point, the delivery tracking of the
    SIGKILL signal will be inaccurate.

    Therefore, we need to add trace_signal_deliver before "goto fatal" after
    executing sigdelset.

    Note: SEND_SIG_NOINFO matches the fact that SIGKILL doesn't have any info.

    Link: http://lkml.kernel.org/r/20190425025812.91424-1-weizhenliang@huawei.com
    Fixes: cf43a757fd4944 ("signal: Restore the stop PTRACE_EVENT_EXIT")
    Signed-off-by: Zhenliang Wei
    Reviewed-by: Christian Brauner
    Reviewed-by: Oleg Nesterov
    Cc: Eric W. Biederman
    Cc: Ivan Delalande
    Cc: Arnd Bergmann
    Cc: Thomas Gleixner
    Cc: Deepa Dinamani
    Cc: Greg Kroah-Hartman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhenliang Wei
     

29 May, 2019

7 commits

  • force_sig_info always delivers to the current task and the signal
    parameter always matches info.si_signo. So remove those parameters to
    make it a simpler less error prone interface, and to make it clear
    that none of the callers are doing anything clever.

    This guarantees that force_sig_info will not grow any new buggy
    callers that attempt to call force_sig on a non-current task, or that
    pass an signal number that does not match info.si_signo.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • All callers of force_sig_info pass info.si_signo in for the signal
    by definition as well as in practice.

    Further all callers of force_sig_info except force_sig_fault_to_task
    pass current as the target task to force_sig_info.

    Factor out a static force_sig_info_to_task that
    force_sig_fault_to_task can call.

    This prepares the way for force_sig_info to have it's task and signal
    parameters removed.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • In preparation for removing the special case in force_sig_info for
    only having a signal number generate an appropriate siginfo in
    force_sig the last caller of force_sig_info that does not
    pass a filled out siginfo.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Forcing a signal or not allowing a pid namespace init to ignore
    SIGKILL or SIGSTOP is more cleanly computed in send_signal.

    There are two cases where we don't allow a pid namespace init
    to ignore SIGKILL or SIGSTOP. If the sending process is
    from an ancestor pid namespace and as such is effectively
    the god to the target process, and if the it is the kernel
    that is sending the signal, not another application.

    It is known that a process is from an ancestor pid namespace if
    it can see it's target but it's target does not have a pid for
    the sender in it's pid namespace.

    It is know that a signal is sent from the kernel if si_code is set to
    SI_KERNEL or info is SEND_SIG_PRIV (which ultimately generates
    a signal with si_code == SI_KERNEL).

    The only signals that matter are SIGKILL and SIGSTOP neither of
    which can really be caught, and both of which always have a siginfo
    layout that includes si_uid and si_pid. Therefore we never need
    to worry about forcing a signal when si_pid and si_uid are absent.

    So handle the two special cases of info and the case when si_pid and
    si_uid are present.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Any time siginfo is not stored in the signal queue information is
    lost. Therefore set TRACE_SIGNAL_LOSE_INFO every time the code does
    not allocate a signal queue entry, and a queue overflow abort is not
    triggered.

    Fixes: ba005e1f4172 ("tracepoint: Add signal loss events")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • As synchronous exceptions really only make sense against the current
    task (otherwise how are you synchronous) remove the task parameter
    from from force_sig_fault to make it explicit that is what is going
    on.

    The two known exceptions that deliver a synchronous exception to a
    stopped ptraced task have already been changed to
    force_sig_fault_to_task.

    The callers have been changed with the following emacs regular expression
    (with obvious variations on the architectures that take more arguments)
    to avoid typos:

    force_sig_fault[(]\([^,]+\)[,]\([^,]+\)[,]\([^,]+\)[,]\W+current[)]
    ->
    force_sig_fault(\1,\2,\3)

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • In preparation for removing the task parameter from force_sig_fault
    introduce force_sig_fault_to_task and use it for the two cases where
    it matters.

    On mips force_fcr31_sig calls force_sig_fault and is called on either
    the current task, or a task that is suspended and is being switched to
    by the scheduler. This is safe because the task being switched to by
    the scheduler is guaranteed to be suspended. This ensures that
    task->sighand is stable while the signal is delivered to it.

    On parisc user_enable_single_step calls force_sig_fault and is in turn
    called by ptrace_request. The function ptrace_request always calls
    user_enable_single_step on a child that is stopped for tracing. The
    child being traced and not reaped ensures that child->sighand is not
    NULL, and that the child will not change child->sighand.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

27 May, 2019

3 commits


23 May, 2019

2 commits

  • The function send_signal was split from __send_signal so that it would
    be possible to bypass the namespace logic based upon current[1]. As it
    turns out the si_pid and the si_uid fixup are both inappropriate in
    the case of kill_pid_usb_asyncio so move that logic into send_signal.

    It is difficult to arrange but possible for a signal with an si_code
    of SI_TIMER or SI_SIGIO to be sent across namespace boundaries. In
    which case tests for when it is ok to change si_pid and si_uid based
    on SI_FROMUSER are incorrect. Replace the use of SI_FROMUSER with a
    new test has_si_pid_and_used based on siginfo_layout.

    Now that the uid fixup is no longer present after expanding
    SEND_SIG_NOINFO properly calculate the si_uid that the target
    task needs to read.

    [1] 7978b567d315 ("signals: add from_ancestor_ns parameter to send_signal()")
    Cc: stable@vger.kernel.org
    Fixes: 6588c1e3ff01 ("signals: SI_USER: Masquerade si_pid when crossing pid ns boundary")
    Fixes: 6b550f949594 ("user namespace: make signal.c respect user namespaces")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • The usb support for asyncio encoded one of it's values in the wrong
    field. It should have used si_value but instead used si_addr which is
    not present in the _rt union member of struct siginfo.

    The practical result of this is that on a 64bit big endian kernel
    when delivering a signal to a 32bit process the si_addr field
    is set to NULL, instead of the expected pointer value.

    This issue can not be fixed in copy_siginfo_to_user32 as the usb
    usage of the the _sigfault (aka si_addr) member of the siginfo
    union when SI_ASYNCIO is set is incompatible with the POSIX and
    glibc usage of the _rt member of the siginfo union.

    Therefore replace kill_pid_info_as_cred with kill_pid_usb_asyncio a
    dedicated function for this one specific case. There are no other
    users of kill_pid_info_as_cred so this specialization should have no
    impact on the amount of code in the kernel. Have kill_pid_usb_asyncio
    take instead of a siginfo_t which is difficult and error prone, 3
    arguments, a signal number, an errno value, and an address enconded as
    a sigval_t. The encoding of the address as a sigval_t allows the
    code that reads the userspace request for a signal to handle this
    compat issue along with all of the other compat issues.

    Add BUILD_BUG_ONs in kernel/signal.c to ensure that we can now place
    the pointer value at the in si_pid (instead of si_addr). That is the
    code now verifies that si_pid and si_addr always occur at the same
    location. Further the code veries that for native structures a value
    placed in si_pid and spilling into si_uid will appear in userspace in
    si_addr (on a byte by byte copy of siginfo or a field by field copy of
    siginfo). The code also verifies that for a 64bit kernel and a 32bit
    userspace the 32bit pointer will fit in si_pid.

    I have used the usbsig.c program below written by Alan Stern and
    slightly tweaked by me to run on a big endian machine to verify the
    issue exists (on sparc64) and to confirm the patch below fixes the issue.

    /* usbsig.c -- test USB async signal delivery */

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static struct usbdevfs_urb urb;
    static struct usbdevfs_disconnectsignal ds;
    static volatile sig_atomic_t done = 0;

    void urb_handler(int sig, siginfo_t *info , void *ucontext)
    {
    printf("Got signal %d, signo %d errno %d code %d addr: %p urb: %p\n",
    sig, info->si_signo, info->si_errno, info->si_code,
    info->si_addr, &urb);

    printf("%s\n", (info->si_addr == &urb) ? "Good" : "Bad");
    }

    void ds_handler(int sig, siginfo_t *info , void *ucontext)
    {
    printf("Got signal %d, signo %d errno %d code %d addr: %p ds: %p\n",
    sig, info->si_signo, info->si_errno, info->si_code,
    info->si_addr, &ds);

    printf("%s\n", (info->si_addr == &ds) ? "Good" : "Bad");
    done = 1;
    }

    int main(int argc, char **argv)
    {
    char *devfilename;
    int fd;
    int rc;
    struct sigaction act;
    struct usb_ctrlrequest *req;
    void *ptr;
    char buf[80];

    if (argc != 2) {
    fprintf(stderr, "Usage: usbsig device-file-name\n");
    return 1;
    }

    devfilename = argv[1];
    fd = open(devfilename, O_RDWR);
    if (fd == -1) {
    perror("Error opening device file");
    return 1;
    }

    act.sa_sigaction = urb_handler;
    sigemptyset(&act.sa_mask);
    act.sa_flags = SA_SIGINFO;

    rc = sigaction(SIGUSR1, &act, NULL);
    if (rc == -1) {
    perror("Error in sigaction");
    return 1;
    }

    act.sa_sigaction = ds_handler;
    sigemptyset(&act.sa_mask);
    act.sa_flags = SA_SIGINFO;

    rc = sigaction(SIGUSR2, &act, NULL);
    if (rc == -1) {
    perror("Error in sigaction");
    return 1;
    }

    memset(&urb, 0, sizeof(urb));
    urb.type = USBDEVFS_URB_TYPE_CONTROL;
    urb.endpoint = USB_DIR_IN | 0;
    urb.buffer = buf;
    urb.buffer_length = sizeof(buf);
    urb.signr = SIGUSR1;

    req = (struct usb_ctrlrequest *) buf;
    req->bRequestType = USB_DIR_IN | USB_TYPE_STANDARD | USB_RECIP_DEVICE;
    req->bRequest = USB_REQ_GET_DESCRIPTOR;
    req->wValue = htole16(USB_DT_DEVICE << 8);
    req->wIndex = htole16(0);
    req->wLength = htole16(sizeof(buf) - sizeof(*req));

    rc = ioctl(fd, USBDEVFS_SUBMITURB, &urb);
    if (rc == -1) {
    perror("Error in SUBMITURB ioctl");
    return 1;
    }

    rc = ioctl(fd, USBDEVFS_REAPURB, &ptr);
    if (rc == -1) {
    perror("Error in REAPURB ioctl");
    return 1;
    }

    memset(&ds, 0, sizeof(ds));
    ds.signr = SIGUSR2;
    ds.context = &ds;
    rc = ioctl(fd, USBDEVFS_DISCSIGNAL, &ds);
    if (rc == -1) {
    perror("Error in DISCSIGNAL ioctl");
    return 1;
    }

    printf("Waiting for usb disconnect\n");
    while (!done) {
    sleep(1);
    }

    close(fd);
    return 0;
    }

    Cc: Greg Kroah-Hartman
    Cc: linux-usb@vger.kernel.org
    Cc: Alan Stern
    Cc: Oliver Neukum
    Fixes: v2.3.39
    Cc: stable@vger.kernel.org
    Acked-by: Alan Stern
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

22 May, 2019

1 commit

  • When a process signals the audit daemon (shutdown, rotate, resume,
    reconfig) but syscall auditing is not enabled, we still want to know the
    identity of the process sending the signal to the audit daemon.

    Move audit_signal_info() out of syscall auditing to general auditing but
    create a new function audit_signal_info_syscall() to take care of the
    syscall dependent parts for when syscall auditing is enabled.

    Please see the github kernel audit issue
    https://github.com/linux-audit/audit-kernel/issues/111

    Signed-off-by: Richard Guy Briggs
    Signed-off-by: Paul Moore

    Richard Guy Briggs
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

17 May, 2019

1 commit

  • Alex Xu reported a regression in strace, caused by the introduction of
    the cgroup v2 freezer. The regression can be reproduced by stracing
    the following simple program:

    #include

    int main() {
    write(1, "a", 1);
    return 0;
    }

    An attempt to run strace ./a.out leads to the infinite loop:
    [ pre-main omitted ]
    write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
    write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
    write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
    write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
    write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
    write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
    [ repeats forever ]

    The problem occurs because the traced task leaves ptrace_stop()
    (and the signal handling loop) with the frozen bit set. So let's
    call cgroup_leave_frozen(true) unconditionally after sleeping
    in ptrace_stop().

    With this patch applied, strace works as expected:
    [ pre-main omitted ]
    write(1, "a", 1) = 1
    exit_group(0) = ?
    +++ exited with 0 +++

    Reported-by: Alex Xu
    Fixes: 76f969e8948d ("cgroup: cgroup v2 freezer")
    Signed-off-by: Roman Gushchin
    Acked-by: Oleg Nesterov
    Cc: Tejun Heo
    Signed-off-by: Tejun Heo

    Roman Gushchin
     

15 May, 2019

1 commit

  • There is a plan to build the kernel with -Wimplicit-fallthrough and this
    place in the code produced a warning (W=1).

    This commit remove the following warning:

    kernel/signal.c:795:13: warning: this statement may fall through [-Wimplicit-fallthrough=]

    Link: http://lkml.kernel.org/r/20190114203505.17875-1-malat@debian.org
    Signed-off-by: Mathieu Malaterre
    Acked-by: Gustavo A. R. Silva
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathieu Malaterre
     

10 May, 2019

1 commit

  • Pull cgroup updates from Tejun Heo:
    "This includes Roman's cgroup2 freezer implementation.

    It's a separate machanism from cgroup1 freezer. Instead of blocking
    user tasks in arbitrary uninterruptible sleeps, the new implementation
    extends jobctl stop - frozen tasks are trapped in jobctl stop until
    thawed and can be killed and ptraced. Lots of thanks to Oleg for
    sheperding the effort.

    Other than that, there are a few trivial changes"

    * 'for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: never call do_group_exit() with task->frozen bit set
    kernel: cgroup: fix misuse of %x
    cgroup: get rid of cgroup_freezer_frozen_exit()
    cgroup: prevent spurious transition into non-frozen state
    cgroup: Remove unused cgrp variable
    cgroup: document cgroup v2 freezer interface
    cgroup: add tracing points for cgroup v2 freezer
    cgroup: make TRACE_CGROUP_PATH irq-safe
    kselftests: cgroup: add freezer controller self-tests
    kselftests: cgroup: don't fail on cg_kill_all() error in cg_destroy()
    cgroup: cgroup v2 freezer
    cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock
    cgroup: implement __cgroup_task_count() helper
    cgroup: rename freezer.c into legacy_freezer.c
    cgroup: remove extra cgroup_migrate_finish() call

    Linus Torvalds
     

09 May, 2019

1 commit

  • I've got two independent reports that cgroup_task_frozen() check
    in cgroup_exit() has been triggered by lkp libhugetlbfs-test and
    LTP ptrace01 tests.

    For example:
    [ 44.576072] WARNING: CPU: 1 PID: 3028 at kernel/cgroup/cgroup.c:5932 cgroup_exit+0x148/0x160
    [ 44.577724] Modules linked in: crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sr_mod cdrom
    bochs_drm sg ttm ata_generic pata_acpi ppdev drm_kms_helper snd_pcm syscopyarea aesni_intel snd_timer
    sysfillrect sysimgblt snd crypto_simd cryptd glue_helper soundcore fb_sys_fops joydev drm serio_raw pcspkr
    ata_piix libata i2c_piix4 floppy parport_pc parport ip_tables
    [ 44.583106] CPU: 1 PID: 3028 Comm: ptrace-write-hu Not tainted 5.1.0-rc3-00053-g9262503 #5
    [ 44.584600] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
    [ 44.586116] RIP: 0010:cgroup_exit+0x148/0x160
    [ 44.587135] Code: 0f 84 50 ff ff ff 48 8b 85 c8 0c 00 00 48 8b 78 70 e8 ec 2e 00 00 e9 3b ff ff ff f0 ff 43 60
    0f 88 72 21 89 00 e9 48 ff ff ff 0b e9 1b ff ff ff e8 3c 73 f4 ff 66 90 66 2e 0f 1f 84 00 00 00
    [ 44.590113] RSP: 0018:ffffb25702dcfd30 EFLAGS: 00010002
    [ 44.591167] RAX: ffff96a7fee32410 RBX: ffff96a7ff1d6000 RCX: dead000000000200
    [ 44.592446] RDX: ffff96a7ff1d6080 RSI: ffff96a7fec75290 RDI: ffff96a7fec75290
    [ 44.593715] RBP: ffff96a7fec745c0 R08: ffff96a7fec74658 R09: 0000000000000000
    [ 44.594985] R10: 0000000000000000 R11: 0000000000000001 R12: ffff96a7fec75101
    [ 44.596266] R13: ffff96a7fec745c0 R14: ffff96a7ff3bde30 R15: ffff96a7fec75130
    [ 44.597550] FS: 0000000000000000(0000) GS:ffff96a7dd700000(0000) knlGS:0000000000000000
    [ 44.598950] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
    [ 44.600098] CR2: 00000000f7a00000 CR3: 000000000d20e000 CR4: 00000000000406e0
    [ 44.601417] Call Trace:
    [ 44.602777] do_exit+0x337/0xc40
    [ 44.603677] do_group_exit+0x3a/0xa0
    [ 44.604610] get_signal+0x12e/0x8d0
    [ 44.605533] ? __switch_to_asm+0x40/0x70
    [ 44.606503] do_signal+0x36/0x650
    [ 44.607409] ? __switch_to_asm+0x40/0x70
    [ 44.608383] ? __schedule+0x267/0x860
    [ 44.609329] exit_to_usermode_loop+0x89/0xf0
    [ 44.610349] do_fast_syscall_32+0x251/0x2e3
    [ 44.611357] entry_SYSENTER_compat+0x7f/0x91
    [ 44.612376] ---[ end trace e4ca5cfc4b7f7964 ]---

    The problem is caused by the ptrace_signal() call in the for loop
    in get_signal(). There is a cgroup_enter_frozen() call inside
    ptrace_signal(), so after exit from ptrace_signal() the task->frozen
    bit might be set. In this case do_group_exit() can be called with the
    task->frozen bit set and trigger the warning. This is only place where
    we can leave the loop with the task->frozen bit set and without
    setting JOBCTL_TRAP_FREEZE and TIF_SIGPENDING.

    To resolve this problem, let's move cgroup_leave_frozen(true) call to
    just after the fatal label. If the task is going to die, the frozen
    bit must be cleared no matter how we get into this point.

    Reported-by: kernel test robot
    Reported-by: Qian Cai
    Cc: Oleg Nesterov
    Cc: Tejun Heo
    Signed-off-by: Roman Gushchin
    Signed-off-by: Tejun Heo

    Roman Gushchin
     

08 May, 2019

1 commit

  • Pull pidfd updates from Christian Brauner:
    "This patchset makes it possible to retrieve pidfds at process creation
    time by introducing the new flag CLONE_PIDFD to the clone() system
    call. Linus originally suggested to implement this as a new flag to
    clone() instead of making it a separate system call.

    After a thorough review from Oleg CLONE_PIDFD returns pidfds in the
    parent_tidptr argument. This means we can give back the associated pid
    and the pidfd at the same time. Access to process metadata information
    thus becomes rather trivial.

    As has been agreed, CLONE_PIDFD creates file descriptors based on
    anonymous inodes similar to the new mount api. They are made
    unconditional by this patchset as they are now needed by core kernel
    code (vfs, pidfd) even more than they already were before (timerfd,
    signalfd, io_uring, epoll etc.). The core patchset is rather small.
    The bulky looking changelist is caused by David's very simple changes
    to Kconfig to make anon inodes unconditional.

    A pidfd comes with additional information in fdinfo if the kernel
    supports procfs. The fdinfo file contains the pid of the process in
    the callers pid namespace in the same format as the procfs status
    file, i.e. "Pid:\t%d".

    To remove worries about missing metadata access this patchset comes
    with a sample/test program that illustrates how a combination of
    CLONE_PIDFD and pidfd_send_signal() can be used to gain race-free
    access to process metadata through /proc/.

    Further work based on this patchset has been done by Joel. His work
    makes pidfds pollable. It finished too late for this merge window. I
    would prefer to have it sitting in linux-next for a while and send it
    for inclusion during the 5.3 merge window"

    * tag 'pidfd-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    samples: show race-free pidfd metadata access
    signal: support CLONE_PIDFD with pidfd_send_signal
    clone: add CLONE_PIDFD
    Make anon_inodes unconditional

    Linus Torvalds
     

07 May, 2019

1 commit

  • Let pidfd_send_signal() use pidfds retrieved via CLONE_PIDFD. With this
    patch pidfd_send_signal() becomes independent of procfs. This fullfils
    the request made when we merged the pidfd_send_signal() patchset. The
    pidfd_send_signal() syscall is now always available allowing for it to
    be used by users without procfs mounted or even users without procfs
    support compiled into the kernel.

    Signed-off-by: Christian Brauner
    Co-developed-by: Jann Horn
    Signed-off-by: Jann Horn
    Acked-by: Oleg Nesterov
    Cc: Arnd Bergmann
    Cc: "Eric W. Biederman"
    Cc: Kees Cook
    Cc: Thomas Gleixner
    Cc: David Howells
    Cc: "Michael Kerrisk (man-pages)"
    Cc: Andy Lutomirsky
    Cc: Andrew Morton
    Cc: Aleksa Sarai
    Cc: Linus Torvalds
    Cc: Al Viro

    Christian Brauner
     

06 May, 2019

1 commit

  • If freezing of a cgroup races with waking of a task from
    the frozen state (like waiting in vfork() or in do_signal_stop()),
    a spurious transition of the cgroup state can happen.

    The task enters cgroup_leave_frozen(true), the cgroup->nr_frozen_tasks
    counter decrements, and the cgroup is switched to the unfrozen state.

    To prevent it, let's reserve cgroup_leave_frozen(true) for
    terminating processes and use cgroup_leave_frozen(false) otherwise.

    To avoid busy-looping in the signal handling loop waiting
    for JOBCTL_TRAP_FREEZE set from the cgroup freezing path,
    let's do it explicitly in cgroup_leave_frozen(), if the task
    is going to stay frozen.

    Suggested-by: Oleg Nesterov
    Signed-off-by: Roman Gushchin
    Signed-off-by: Tejun Heo

    Roman Gushchin
     

20 Apr, 2019

1 commit

  • Cgroup v1 implements the freezer controller, which provides an ability
    to stop the workload in a cgroup and temporarily free up some
    resources (cpu, io, network bandwidth and, potentially, memory)
    for some other tasks. Cgroup v2 lacks this functionality.

    This patch implements freezer for cgroup v2.

    Cgroup v2 freezer tries to put tasks into a state similar to jobctl
    stop. This means that tasks can be killed, ptraced (using
    PTRACE_SEIZE*), and interrupted. It is possible to attach to
    a frozen task, get some information (e.g. read registers) and detach.
    It's also possible to migrate a frozen tasks to another cgroup.

    This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
    tried to imitate the system-wide freezer. However uninterruptible
    sleep is fine when all tasks are going to be frozen (hibernation case),
    it's not the acceptable state for some subset of the system.

    Cgroup v2 freezer is not supporting freezing kthreads.
    If a non-root cgroup contains kthread, the cgroup still can be frozen,
    but the kthread will remain running, the cgroup will be shown
    as non-frozen, and the notification will not be delivered.

    * PTRACE_ATTACH is not working because non-fatal signal delivery
    is blocked in frozen state.

    There are some interface differences between cgroup v1 and cgroup v2
    freezer too, which are required to conform the cgroup v2 interface
    design principles:
    1) There is no separate controller, which has to be turned on:
    the functionality is always available and is represented by
    cgroup.freeze and cgroup.events cgroup control files.
    2) The desired state is defined by the cgroup.freeze control file.
    Any hierarchical configuration is allowed.
    3) The interface is asynchronous. The actual state is available
    using cgroup.events control file ("frozen" field). There are no
    dedicated transitional states.
    4) It's allowed to make any changes with the cgroup hierarchy
    (create new cgroups, remove old cgroups, move tasks between cgroups)
    no matter if some cgroups are frozen.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Tejun Heo
    No-objection-from-me-by: Oleg Nesterov
    Cc: kernel-team@fb.com

    Roman Gushchin
     

18 Apr, 2019

1 commit

  • As stated in the original commit for pidfd_send_signal() we don't allow
    to signal processes through O_PATH file descriptors since it is
    semantically equivalent to a write on the pidfd.

    We already correctly error out right now and return EBADF if an O_PATH
    fd is passed. This is because we use file->f_op to detect whether a
    pidfd is passed and O_PATH fds have their file->f_op set to empty_fops
    in do_dentry_open() and thus fail the test.

    Thus, there is no regression. It's just semantically correct to use
    fdget() and return an error right from there instead of taking a
    reference and returning an error later.

    Signed-off-by: Christian Brauner
    Acked-by: Oleg Nesterov
    Cc: Arnd Bergmann
    Cc: "Eric W. Biederman"
    Cc: Kees Cook
    Cc: Thomas Gleixner
    Cc: Jann Horn
    Cc: David Howells
    Cc: "Michael Kerrisk (man-pages)"
    Cc: Andy Lutomirsky
    Cc: Andrew Morton
    Cc: Oleg Nesterov
    Cc: Aleksa Sarai
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Christian Brauner
     

02 Apr, 2019

1 commit

  • The current sys_pidfd_send_signal() silently turns signals with explicit
    SI_USER context that are sent to non-current tasks into signals with
    kernel-generated siginfo.
    This is unlike do_rt_sigqueueinfo(), which returns -EPERM in this case.
    If a user actually wants to send a signal with kernel-provided siginfo,
    they can do that with pidfd_send_signal(pidfd, sig, NULL, 0); so allowing
    this case is unnecessary.

    Instead of silently replacing the siginfo, just bail out with an error;
    this is consistent with other interfaces and avoids special-casing behavior
    based on security checks.

    Fixes: 3eb39f47934f ("signal: add pidfd_send_signal() syscall")
    Signed-off-by: Jann Horn
    Signed-off-by: Christian Brauner

    Jann Horn
     

17 Mar, 2019

1 commit

  • Pull pidfd system call from Christian Brauner:
    "This introduces the ability to use file descriptors from /proc//
    as stable handles on struct pid. Even if a pid is recycled the handle
    will not change. For a start these fds can be used to send signals to
    the processes they refer to.

    With the ability to use /proc/ fds as stable handles on struct
    pid we can fix a long-standing issue where after a process has exited
    its pid can be reused by another process. If a caller sends a signal
    to a reused pid it will end up signaling the wrong process.

    With this patchset we enable a variety of use cases. One obvious
    example is that we can now safely delegate an important part of
    process management - sending signals - to processes other than the
    parent of a given process by sending file descriptors around via scm
    rights and not fearing that the given process will have been recycled
    in the meantime. It also allows for easy testing whether a given
    process is still alive or not by sending signal 0 to a pidfd which is
    quite handy.

    There has been some interest in this feature e.g. from systems
    management (systemd, glibc) and container managers. I have requested
    and gotten comments from glibc to make sure that this syscall is
    suitable for their needs as well. In the future I expect it to take on
    most other pid-based signal syscalls. But such features are left for
    the future once they are needed.

    This has been sitting in linux-next for quite a while and has not
    caused any issues. It comes with selftests which verify basic
    functionality and also test that a recycled pid cannot be signaled via
    a pidfd.

    Jon has written about a prior version of this patchset. It should
    cover the basic functionality since not a lot has changed since then:

    https://lwn.net/Articles/773459/

    The commit message for the syscall itself is extensively documenting
    the syscall, including it's functionality and extensibility"

    * tag 'pidfd-v5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    selftests: add tests for pidfd_send_signal()
    signal: add pidfd_send_signal() syscall

    Linus Torvalds
     

06 Mar, 2019

1 commit

  • Pull year 2038 updates from Thomas Gleixner:
    "Another round of changes to make the kernel ready for 2038. After lots
    of preparatory work this is the first set of syscalls which are 2038
    safe:

    403 clock_gettime64
    404 clock_settime64
    405 clock_adjtime64
    406 clock_getres_time64
    407 clock_nanosleep_time64
    408 timer_gettime64
    409 timer_settime64
    410 timerfd_gettime64
    411 timerfd_settime64
    412 utimensat_time64
    413 pselect6_time64
    414 ppoll_time64
    416 io_pgetevents_time64
    417 recvmmsg_time64
    418 mq_timedsend_time64
    419 mq_timedreceiv_time64
    420 semtimedop_time64
    421 rt_sigtimedwait_time64
    422 futex_time64
    423 sched_rr_get_interval_time64

    The syscall numbers are identical all over the architectures"

    * 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
    riscv: Use latest system call ABI
    checksyscalls: fix up mq_timedreceive and stat exceptions
    unicore32: Fix __ARCH_WANT_STAT64 definition
    asm-generic: Make time32 syscall numbers optional
    asm-generic: Drop getrlimit and setrlimit syscalls from default list
    32-bit userspace ABI: introduce ARCH_32BIT_OFF_T config option
    compat ABI: use non-compat openat and open_by_handle_at variants
    y2038: add 64-bit time_t syscalls to all 32-bit architectures
    y2038: rename old time and utime syscalls
    y2038: remove struct definition redirects
    y2038: use time32 syscall names on 32-bit
    syscalls: remove obsolete __IGNORE_ macros
    y2038: syscalls: rename y2038 compat syscalls
    x86/x32: use time64 versions of sigtimedwait and recvmmsg
    timex: change syscalls to use struct __kernel_timex
    timex: use __kernel_timex internally
    sparc64: add custom adjtimex/clock_adjtime functions
    time: fix sys_timer_settime prototype
    time: Add struct __kernel_timex
    time: make adjtime compat handling available for 32 bit
    ...

    Linus Torvalds