02 May, 2020

1 commit

  • [ Upstream commit eaec2b0bd30690575c581eebffae64bfb7f684ac ]

    In kill_pid_usb_asyncio, if signal is not valid, we do not need to
    set info struct.

    Signed-off-by: Zhiqiang Liu
    Acked-by: Christian Brauner
    Link: https://lore.kernel.org/r/f525fd08-1cf7-fb09-d20c-4359145eb940@huawei.com
    Signed-off-by: Christian Brauner
    Signed-off-by: Sasha Levin

    Zhiqiang Liu
     

29 Apr, 2020

1 commit

  • commit 61e713bdca3678e84815f2427f7a063fc353a1fc upstream.

    Christof Meerwald writes:
    > Hi,
    >
    > this is probably related to commit
    > 7a0cf094944e2540758b7f957eb6846d5126f535 (signal: Correct namespace
    > fixups of si_pid and si_uid).
    >
    > With a 5.6.5 kernel I am seeing SIGCHLD signals that don't include a
    > properly set si_pid field - this seems to happen for multi-threaded
    > child processes.
    >
    > A simple test program (based on the sample from the signalfd man page):
    >
    > #include
    > #include
    > #include
    > #include
    > #include
    > #include
    >
    > #define handle_error(msg) \
    > do { perror(msg); exit(EXIT_FAILURE); } while (0)
    >
    > int main(int argc, char *argv[])
    > {
    > sigset_t mask;
    > int sfd;
    > struct signalfd_siginfo fdsi;
    > ssize_t s;
    >
    > sigemptyset(&mask);
    > sigaddset(&mask, SIGCHLD);
    >
    > if (sigprocmask(SIG_BLOCK, &mask, NULL) == -1)
    > handle_error("sigprocmask");
    >
    > pid_t chldpid;
    > char *chldargv[] = { "./sfdclient", NULL };
    > posix_spawn(&chldpid, "./sfdclient", NULL, NULL, chldargv, NULL);
    >
    > sfd = signalfd(-1, &mask, 0);
    > if (sfd == -1)
    > handle_error("signalfd");
    >
    > for (;;) {
    > s = read(sfd, &fdsi, sizeof(struct signalfd_siginfo));
    > if (s != sizeof(struct signalfd_siginfo))
    > handle_error("read");
    >
    > if (fdsi.ssi_signo == SIGCHLD) {
    > printf("Got SIGCHLD %d %d %d %d\n",
    > fdsi.ssi_status, fdsi.ssi_code,
    > fdsi.ssi_uid, fdsi.ssi_pid);
    > return 0;
    > } else {
    > printf("Read unexpected signal\n");
    > }
    > }
    > }
    >
    >
    > and a multi-threaded client to test with:
    >
    > #include
    > #include
    >
    > void *f(void *arg)
    > {
    > sleep(100);
    > }
    >
    > int main()
    > {
    > pthread_t t[8];
    >
    > for (int i = 0; i != 8; ++i)
    > {
    > pthread_create(&t[i], NULL, f, NULL);
    > }
    > }
    >
    > I tried to do a bit of debugging and what seems to be happening is
    > that
    >
    > /* From an ancestor pid namespace? */
    > if (!task_pid_nr_ns(current, task_active_pid_ns(t))) {
    >
    > fails inside task_pid_nr_ns because the check for "pid_alive" fails.
    >
    > This code seems to be called from do_notify_parent and there we
    > actually have "tsk != current" (I am assuming both are threads of the
    > current process?)

    I instrumented the code with a warning and received the following backtrace:
    > WARNING: CPU: 0 PID: 777 at kernel/pid.c:501 __task_pid_nr_ns.cold.6+0xc/0x15
    > Modules linked in:
    > CPU: 0 PID: 777 Comm: sfdclient Not tainted 5.7.0-rc1userns+ #2924
    > Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    > RIP: 0010:__task_pid_nr_ns.cold.6+0xc/0x15
    > Code: ff 66 90 48 83 ec 08 89 7c 24 04 48 8d 7e 08 48 8d 74 24 04 e8 9a b6 44 00 48 83 c4 08 c3 48 c7 c7 59 9f ac 82 e8 c2 c4 04 00 0b e9 3fd
    > RSP: 0018:ffffc9000042fbf8 EFLAGS: 00010046
    > RAX: 000000000000000c RBX: 0000000000000000 RCX: ffffc9000042faf4
    > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff81193d29
    > RBP: ffffc9000042fc18 R08: 0000000000000000 R09: 0000000000000001
    > R10: 000000100f938416 R11: 0000000000000309 R12: ffff8880b941c140
    > R13: 0000000000000000 R14: 0000000000000000 R15: ffff8880b941c140
    > FS: 0000000000000000(0000) GS:ffff8880bca00000(0000) knlGS:0000000000000000
    > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    > CR2: 00007f2e8c0a32e0 CR3: 0000000002e10000 CR4: 00000000000006f0
    > Call Trace:
    > send_signal+0x1c8/0x310
    > do_notify_parent+0x50f/0x550
    > release_task.part.21+0x4fd/0x620
    > do_exit+0x6f6/0xaf0
    > do_group_exit+0x42/0xb0
    > get_signal+0x13b/0xbb0
    > do_signal+0x2b/0x670
    > ? __audit_syscall_exit+0x24d/0x2b0
    > ? rcu_read_lock_sched_held+0x4d/0x60
    > ? kfree+0x24c/0x2b0
    > do_syscall_64+0x176/0x640
    > ? trace_hardirqs_off_thunk+0x1a/0x1c
    > entry_SYSCALL_64_after_hwframe+0x49/0xb3

    The immediate problem is as Christof noticed that "pid_alive(current) == false".
    This happens because do_notify_parent is called from the last thread to exit
    in a process after that thread has been reaped.

    The bigger issue is that do_notify_parent can be called from any
    process that manages to wait on a thread of a multi-threaded process
    from wait_task_zombie. So any logic based upon current for
    do_notify_parent is just nonsense, as current can be pretty much
    anything.

    So change do_notify_parent to call __send_signal directly.

    Inspecting the code it appears this problem has existed since the pid
    namespace support started handling this case in 2.6.30. This fix only
    backports to 7a0cf094944e ("signal: Correct namespace fixups of si_pid and si_uid")
    where the problem logic was moved out of __send_signal and into send_signal.

    Cc: stable@vger.kernel.org
    Fixes: 6588c1e3ff01 ("signals: SI_USER: Masquerade si_pid when crossing pid ns boundary")
    Ref: 921cf9f63089 ("signals: protect cinit from unblocked SIG_DFL signals")
    Link: https://lore.kernel.org/lkml/20200419201336.GI22017@edge.cmeerw.net/
    Reported-by: Christof Meerwald
    Acked-by: Oleg Nesterov
    Acked-by: Christian Brauner
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

17 Apr, 2020

1 commit

  • commit d1e7fd6462ca9fc76650fbe6ca800e35b24267da upstream.

    Replace the 32bit exec_id with a 64bit exec_id to make it impossible
    to wrap the exec_id counter. With care an attacker can cause exec_id
    wrap and send arbitrary signals to a newly exec'd parent. This
    bypasses the signal sending checks if the parent changes their
    credentials during exec.

    The severity of this problem can been seen that in my limited testing
    of a 32bit exec_id it can take as little as 19s to exec 65536 times.
    Which means that it can take as little as 14 days to wrap a 32bit
    exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7
    days. Even my slower timing is in the uptime of a typical server.
    Which means self_exec_id is simply a speed bump today, and if exec
    gets noticably faster self_exec_id won't even be a speed bump.

    Extending self_exec_id to 64bits introduces a problem on 32bit
    architectures where reading self_exec_id is no longer atomic and can
    take two read instructions. Which means that is is possible to hit
    a window where the read value of exec_id does not match the written
    value. So with very lucky timing after this change this still
    remains expoiltable.

    I have updated the update of exec_id on exec to use WRITE_ONCE
    and the read of exec_id in do_notify_parent to use READ_ONCE
    to make it clear that there is no locking between these two
    locations.

    Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl
    Fixes: 2.3.23pre2
    Cc: stable@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

21 Mar, 2020

1 commit

  • [ Upstream commit fda31c50292a5062332fa0343c084bd9f46604d9 ]

    When queueing a signal, we increment both the users count of pending
    signals (for RLIMIT_SIGPENDING tracking) and we increment the refcount
    of the user struct itself (because we keep a reference to the user in
    the signal structure in order to correctly account for it when freeing).

    That turns out to be fairly expensive, because both of them are atomic
    updates, and particularly under extreme signal handling pressure on big
    machines, you can get a lot of cache contention on the user struct.
    That can then cause horrid cacheline ping-pong when you do these
    multiple accesses.

    So change the reference counting to only pin the user for the _first_
    pending signal, and to unpin it when the last pending signal is
    dequeued. That means that when a user sees a lot of concurrent signal
    queuing - which is the only situation when this matters - the only
    atomic access needed is generally the 'sigpending' count update.

    This was noticed because of a particularly odd timing artifact on a
    dual-socket 96C/192T Cascade Lake platform: when you get into bad
    contention, on that machine for some reason seems to be much worse when
    the contention happens in the upper 32-byte half of the cacheline.

    As a result, the kernel test robot will-it-scale 'signal1' benchmark had
    an odd performance regression simply due to random alignment of the
    'struct user_struct' (and pointed to a completely unrelated and
    apparently nonsensical commit for the regression).

    Avoiding the double increments (and decrements on the dequeueing side,
    of course) makes for much less contention and hugely improved
    performance on that will-it-scale microbenchmark.

    Quoting Feng Tang:

    "It makes a big difference, that the performance score is tripled! bump
    from original 17000 to 54000. Also the gap between 5.0-rc6 and
    5.0-rc6+Jiri's patch is reduced to around 2%"

    [ The "2% gap" is the odd cacheline placement difference on that
    platform: under the extreme contention case, the effect of which half
    of the cacheline was hot was 5%, so with the reduced contention the
    odd timing artifact is reduced too ]

    It does help in the non-contended case too, but is not nearly as
    noticeable.

    Reported-and-tested-by: Feng Tang
    Cc: Eric W. Biederman
    Cc: Huang, Ying
    Cc: Philip Li
    Cc: Andi Kleen
    Cc: Jiri Olsa
    Cc: Peter Zijlstra
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Linus Torvalds
     

11 Oct, 2019

1 commit


17 Sep, 2019

1 commit

  • Pull pidfd/waitid updates from Christian Brauner:
    "This contains two features and various tests.

    First, it adds support for waiting on process through pidfds by adding
    the P_PIDFD type to the waitid() syscall. This completes the basic
    functionality of the pidfd api (cf. [1]). In the meantime we also have
    a new adition to the userspace projects that make use of the pidfd
    api. The qt project was nice enough to send a mail pointing out that
    they have a pr up to switch to the pidfd api (cf. [2]).

    Second, this tag contains an extension to the waitid() syscall to make
    it possible to wait on the current process group in a race free manner
    (even though the actual problem is very unlikely) by specifing 0
    together with the P_PGID type. This extension traces back to a
    discussion on the glibc development mailing list.

    There are also a range of tests for the features above. Additionally,
    the test-suite which detected the pidfd-polling race we fixed in [3]
    is included in this tag"

    [1] https://lwn.net/Articles/794707/
    [2] https://codereview.qt-project.org/c/qt/qtbase/+/108456
    [3] commit b191d6491be6 ("pidfd: fix a poll race when setting exit_state")

    * tag 'core-process-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    waitid: Add support for waiting for the current process group
    tests: add pidfd poll tests
    tests: move common definitions and functions into pidfd.h
    pidfd: add pidfd_wait tests
    pidfd: add P_PIDFD to waitid()

    Linus Torvalds
     

19 Aug, 2019

1 commit

  • My recent to change to only use force_sig for a synchronous events
    wound up breaking signal reception cifs and drbd. I had overlooked
    the fact that by default kthreads start out with all signals set to
    SIG_IGN. So a change I thought was safe turned out to have made it
    impossible for those kernel thread to catch their signals.

    Reverting the work on force_sig is a bad idea because what the code
    was doing was very much a misuse of force_sig. As the way force_sig
    ultimately allowed the signal to happen was to change the signal
    handler to SIG_DFL. Which after the first signal will allow userspace
    to send signals to these kernel threads. At least for
    wake_ack_receiver in drbd that does not appear actively wrong.

    So correct this problem by adding allow_kernel_signal that will allow
    signals whose siginfo reports they were sent by the kernel through,
    but will not allow userspace generated signals, and update cifs and
    drbd to call allow_kernel_signal in an appropriate place so that their
    thread can receive this signal.

    Fixing things this way ensures that userspace won't be able to send
    signals and cause problems, that it is clear which signals the
    threads are expecting to receive, and it guarantees that nothing
    else in the system will be affected.

    This change was partly inspired by similar cifs and drbd patches that
    added allow_signal.

    Reported-by: ronnie sahlberg
    Reported-by: Christoph Böhmwalder
    Tested-by: Christoph Böhmwalder
    Cc: Steve French
    Cc: Philipp Reisner
    Cc: David Laight
    Fixes: 247bc9470b1e ("cifs: fix rmmod regression in cifs.ko caused by force_sig changes")
    Fixes: 72abe3bcf091 ("signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig")
    Fixes: fee109901f39 ("signal/drbd: Use send_sig not force_sig")
    Fixes: 3cf5d076fb4d ("signal: Remove task parameter from force_sig")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

03 Aug, 2019

1 commit

  • The kernel-doc parser doesn't handle expressions with %foo*. Instead,
    when an asterisk should be part of a constant, it uses an alternative
    notation: `foo*`.

    Link: http://lkml.kernel.org/r/7f18c2e0b5e39e6b7eb55ddeb043b8b260b49f2d.1563361575.git.mchehab+samsung@kernel.org
    Signed-off-by: Mauro Carvalho Chehab
    Cc: Deepa Dinamani
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mauro Carvalho Chehab
     

02 Aug, 2019

1 commit

  • This adds the P_PIDFD type to waitid().
    One of the last remaining bits for the pidfd api is to make it possible
    to wait on pidfds. With P_PIDFD added to waitid() the parts of userspace
    that want to use the pidfd api to exclusively manage processes can do so
    now.

    One of the things this will unblock in the future is the ability to make
    it possible to retrieve the exit status via waitid(P_PIDFD) for
    non-parent processes if handed a _suitable_ pidfd that has this feature
    set. This is similar to what you can do on FreeBSD with kqueue(). It
    might even end up being possible to wait on a process as a non-parent if
    an appropriate property is enabled on the pidfd.

    With P_PIDFD no scoping of the process identified by the pidfd is
    possible, i.e. it explicitly blocks things such as wait4(-1), wait4(0),
    waitid(P_ALL), waitid(P_PGID) etc. It only allows for semantics
    equivalent to wait4(pid), waitid(P_PID). Users that need scoping should
    rely on pid-based wait*() syscalls for now.

    Signed-off-by: Christian Brauner
    Reviewed-by: Kees Cook
    Reviewed-by: Oleg Nesterov
    Cc: Arnd Bergmann
    Cc: "Eric W. Biederman"
    Cc: Joel Fernandes (Google)
    Cc: Thomas Gleixner
    Cc: David Howells
    Cc: Jann Horn
    Cc: Andy Lutomirsky
    Cc: Andrew Morton
    Cc: Aleksa Sarai
    Cc: Linus Torvalds
    Cc: Al Viro
    Link: https://lore.kernel.org/r/20190727222229.6516-2-christian@brauner.io

    Christian Brauner
     

29 Jul, 2019

1 commit

  • Previously a condition got missed where the pidfd waiters are awakened
    before the exit_state gets set. This can result in a missed notification
    [1] and the polling thread waiting forever.

    It is fixed now, however it would be nice to avoid this kind of issue
    going unnoticed in the future. So just add a warning to catch it in the
    future.

    /* References */
    [1]: https://lore.kernel.org/lkml/20190717172100.261204-1-joel@joelfernandes.org/

    Signed-off-by: Joel Fernandes (Google)
    Link: https://lore.kernel.org/r/20190724164816.201099-1-joel@joelfernandes.org
    Signed-off-by: Christian Brauner

    Joel Fernandes (Google)
     

17 Jul, 2019

1 commit

  • task->saved_sigmask and ->restore_sigmask are only used in the ret-from-
    syscall paths. This means that set_user_sigmask() can save ->blocked in
    ->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked
    was modified.

    This way the callers do not need 2 sigset_t's passed to set/restore and
    restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns
    into the trivial helper which just calls restore_saved_sigmask().

    Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.com
    Signed-off-by: Oleg Nesterov
    Cc: Deepa Dinamani
    Cc: Arnd Bergmann
    Cc: Jens Axboe
    Cc: Davidlohr Bueso
    Cc: Eric Wong
    Cc: Jason Baron
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc: Eric W. Biederman
    Cc: David Laight
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

11 Jul, 2019

1 commit

  • Pull pidfd updates from Christian Brauner:
    "This adds two main features.

    - First, it adds polling support for pidfds. This allows process
    managers to know when a (non-parent) process dies in a race-free
    way.

    The notification mechanism used follows the same logic that is
    currently used when the parent of a task is notified of a child's
    death. With this patchset it is possible to put pidfds in an
    {e}poll loop and get reliable notifications for process (i.e.
    thread-group) exit.

    - The second feature compliments the first one by making it possible
    to retrieve pollable pidfds for processes that were not created
    using CLONE_PIDFD.

    A lot of processes get created with traditional PID-based calls
    such as fork() or clone() (without CLONE_PIDFD). For these
    processes a caller can currently not create a pollable pidfd. This
    is a problem for Android's low memory killer (LMK) and service
    managers such as systemd.

    Both patchsets are accompanied by selftests.

    It's perhaps worth noting that the work done so far and the work done
    in this branch for pidfd_open() and polling support do already see
    some adoption:

    - Android is in the process of backporting this work to all their LTS
    kernels [1]

    - Service managers make use of pidfd_send_signal but will need to
    wait until we enable waiting on pidfds for full adoption.

    - And projects I maintain make use of both pidfd_send_signal and
    CLONE_PIDFD [2] and will use polling support and pidfd_open() too"

    [1] https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.9+backport%22
    https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.14+backport%22
    https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.19+backport%22

    [2] https://github.com/lxc/lxc/blob/aab6e3eb73c343231cdde775db938994fc6f2803/src/lxc/start.c#L1753

    * tag 'pidfd-updates-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    tests: add pidfd_open() tests
    arch: wire-up pidfd_open()
    pid: add pidfd_open()
    pidfd: add polling selftests
    pidfd: add polling support

    Linus Torvalds
     

09 Jul, 2019

2 commits

  • …iederm/user-namespace

    Pull force_sig() argument change from Eric Biederman:
    "A source of error over the years has been that force_sig has taken a
    task parameter when it is only safe to use force_sig with the current
    task.

    The force_sig function is built for delivering synchronous signals
    such as SIGSEGV where the userspace application caused a synchronous
    fault (such as a page fault) and the kernel responded with a signal.

    Because the name force_sig does not make this clear, and because the
    force_sig takes a task parameter the function force_sig has been
    abused for sending other kinds of signals over the years. Slowly those
    have been fixed when the oopses have been tracked down.

    This set of changes fixes the remaining abusers of force_sig and
    carefully rips out the task parameter from force_sig and friends
    making this kind of error almost impossible in the future"

    * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
    signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus
    signal: Remove the signal number and task parameters from force_sig_info
    signal: Factor force_sig_info_to_task out of force_sig_info
    signal: Generate the siginfo in force_sig
    signal: Move the computation of force into send_signal and correct it.
    signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal
    signal: Remove the task parameter from force_sig_fault
    signal: Use force_sig_fault_to_task for the two calls that don't deliver to current
    signal: Explicitly call force_sig_fault on current
    signal/unicore32: Remove tsk parameter from __do_user_fault
    signal/arm: Remove tsk parameter from __do_user_fault
    signal/arm: Remove tsk parameter from ptrace_break
    signal/nds32: Remove tsk parameter from send_sigtrap
    signal/riscv: Remove tsk parameter from do_trap
    signal/sh: Remove tsk parameter from force_sig_info_fault
    signal/um: Remove task parameter from send_sigtrap
    signal/x86: Remove task parameter from send_sigtrap
    signal: Remove task parameter from force_sig_mceerr
    signal: Remove task parameter from force_sig
    signal: Remove task parameter from force_sigsegv
    ...

    Linus Torvalds
     
  • Pull audit updates from Paul Moore:
    "This pull request is a bit early, but with some vacation time coming
    up I wanted to send this out now just in case the remote Internet Gods
    decide not to smile on me once the merge window opens. The patchset
    for v5.3 is pretty minor this time, the highlights include:

    - When the audit daemon is sent a signal, ensure we deliver
    information about the sender even when syscall auditing is not
    enabled/supported.

    - Add the ability to filter audit records based on network address
    family.

    - Tighten the audit field filtering restrictions on string based
    fields.

    - Cleanup the audit field filtering verification code.

    - Remove a few BUG() calls from the audit code"

    * tag 'audit-pr-20190702' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
    audit: remove the BUG() calls in the audit rule comparison functions
    audit: enforce op for string fields
    audit: add saddr_fam filter field
    audit: re-structure audit field valid checks
    audit: deliver signal_info regarless of syscall

    Linus Torvalds
     

29 Jun, 2019

1 commit

  • This is the minimal fix for stable, I'll send cleanups later.

    Commit 854a6ed56839 ("signal: Add restore_user_sigmask()") introduced
    the visible change which breaks user-space: a signal temporary unblocked
    by set_user_sigmask() can be delivered even if the caller returns
    success or timeout.

    Change restore_user_sigmask() to accept the additional "interrupted"
    argument which should be used instead of signal_pending() check, and
    update the callers.

    Eric said:

    : For clarity. I don't think this is required by posix, or fundamentally to
    : remove the races in select. It is what linux has always done and we have
    : applications who care so I agree this fix is needed.
    :
    : Further in any case where the semantic change that this patch rolls back
    : (aka where allowing a signal to be delivered and the select like call to
    : complete) would be advantage we can do as well if not better by using
    : signalfd.
    :
    : Michael is there any chance we can get this guarantee of the linux
    : implementation of pselect and friends clearly documented. The guarantee
    : that if the system call completes successfully we are guaranteed that no
    : signal that is unblocked by using sigmask will be delivered?

    Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com
    Fixes: 854a6ed56839a40f6b5d02a2962f48841482eec4 ("signal: Add restore_user_sigmask()")
    Signed-off-by: Oleg Nesterov
    Reported-by: Eric Wong
    Tested-by: Eric Wong
    Acked-by: "Eric W. Biederman"
    Acked-by: Arnd Bergmann
    Acked-by: Deepa Dinamani
    Cc: Michael Kerrisk
    Cc: Jens Axboe
    Cc: Davidlohr Bueso
    Cc: Jason Baron
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc: David Laight
    Cc: [5.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

28 Jun, 2019

1 commit

  • This patch adds polling support to pidfd.

    Android low memory killer (LMK) needs to know when a process dies once
    it is sent the kill signal. It does so by checking for the existence of
    /proc/pid which is both racy and slow. For example, if a PID is reused
    between when LMK sends a kill signal and checks for existence of the
    PID, since the wrong PID is now possibly checked for existence.
    Using the polling support, LMK will be able to get notified when a process
    exists in race-free and fast way, and allows the LMK to do other things
    (such as by polling on other fds) while awaiting the process being killed
    to die.

    For notification to polling processes, we follow the same existing
    mechanism in the kernel used when the parent of the task group is to be
    notified of a child's death (do_notify_parent). This is precisely when the
    tasks waiting on a poll of pidfd are also awakened in this patch.

    We have decided to include the waitqueue in struct pid for the following
    reasons:
    1. The wait queue has to survive for the lifetime of the poll. Including
    it in task_struct would not be option in this case because the task can
    be reaped and destroyed before the poll returns.

    2. By including the struct pid for the waitqueue means that during
    de_thread(), the new thread group leader automatically gets the new
    waitqueue/pid even though its task_struct is different.

    Appropriate test cases are added in the second patch to provide coverage of
    all the cases the patch is handling.

    Cc: Andy Lutomirski
    Cc: Steven Rostedt
    Cc: Daniel Colascione
    Cc: Jann Horn
    Cc: Tim Murray
    Cc: Jonathan Kowalski
    Cc: Linus Torvalds
    Cc: Al Viro
    Cc: Kees Cook
    Cc: David Howells
    Cc: Oleg Nesterov
    Cc: kernel-team@android.com
    Reviewed-by: Oleg Nesterov
    Co-developed-by: Daniel Colascione
    Signed-off-by: Daniel Colascione
    Signed-off-by: Joel Fernandes (Google)
    Signed-off-by: Christian Brauner

    Joel Fernandes (Google)
     

05 Jun, 2019

1 commit

  • Improve the comments for pidfd_send_signal().
    First, the comment still referred to a file descriptor for a process as a
    "task file descriptor" which stems from way back at the beginning of the
    discussion. Replace this with "pidfd" for consistency.
    Second, the wording for the explanation of the arguments to the syscall
    was a bit inconsistent, e.g. some used the past tense some used present
    tense. Make the wording more consistent.

    Signed-off-by: Christian Brauner

    Christian Brauner
     

02 Jun, 2019

1 commit

  • In the fixes commit, removing SIGKILL from each thread signal mask and
    executing "goto fatal" directly will skip the call to
    "trace_signal_deliver". At this point, the delivery tracking of the
    SIGKILL signal will be inaccurate.

    Therefore, we need to add trace_signal_deliver before "goto fatal" after
    executing sigdelset.

    Note: SEND_SIG_NOINFO matches the fact that SIGKILL doesn't have any info.

    Link: http://lkml.kernel.org/r/20190425025812.91424-1-weizhenliang@huawei.com
    Fixes: cf43a757fd4944 ("signal: Restore the stop PTRACE_EVENT_EXIT")
    Signed-off-by: Zhenliang Wei
    Reviewed-by: Christian Brauner
    Reviewed-by: Oleg Nesterov
    Cc: Eric W. Biederman
    Cc: Ivan Delalande
    Cc: Arnd Bergmann
    Cc: Thomas Gleixner
    Cc: Deepa Dinamani
    Cc: Greg Kroah-Hartman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhenliang Wei
     

29 May, 2019

7 commits

  • force_sig_info always delivers to the current task and the signal
    parameter always matches info.si_signo. So remove those parameters to
    make it a simpler less error prone interface, and to make it clear
    that none of the callers are doing anything clever.

    This guarantees that force_sig_info will not grow any new buggy
    callers that attempt to call force_sig on a non-current task, or that
    pass an signal number that does not match info.si_signo.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • All callers of force_sig_info pass info.si_signo in for the signal
    by definition as well as in practice.

    Further all callers of force_sig_info except force_sig_fault_to_task
    pass current as the target task to force_sig_info.

    Factor out a static force_sig_info_to_task that
    force_sig_fault_to_task can call.

    This prepares the way for force_sig_info to have it's task and signal
    parameters removed.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • In preparation for removing the special case in force_sig_info for
    only having a signal number generate an appropriate siginfo in
    force_sig the last caller of force_sig_info that does not
    pass a filled out siginfo.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Forcing a signal or not allowing a pid namespace init to ignore
    SIGKILL or SIGSTOP is more cleanly computed in send_signal.

    There are two cases where we don't allow a pid namespace init
    to ignore SIGKILL or SIGSTOP. If the sending process is
    from an ancestor pid namespace and as such is effectively
    the god to the target process, and if the it is the kernel
    that is sending the signal, not another application.

    It is known that a process is from an ancestor pid namespace if
    it can see it's target but it's target does not have a pid for
    the sender in it's pid namespace.

    It is know that a signal is sent from the kernel if si_code is set to
    SI_KERNEL or info is SEND_SIG_PRIV (which ultimately generates
    a signal with si_code == SI_KERNEL).

    The only signals that matter are SIGKILL and SIGSTOP neither of
    which can really be caught, and both of which always have a siginfo
    layout that includes si_uid and si_pid. Therefore we never need
    to worry about forcing a signal when si_pid and si_uid are absent.

    So handle the two special cases of info and the case when si_pid and
    si_uid are present.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Any time siginfo is not stored in the signal queue information is
    lost. Therefore set TRACE_SIGNAL_LOSE_INFO every time the code does
    not allocate a signal queue entry, and a queue overflow abort is not
    triggered.

    Fixes: ba005e1f4172 ("tracepoint: Add signal loss events")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • As synchronous exceptions really only make sense against the current
    task (otherwise how are you synchronous) remove the task parameter
    from from force_sig_fault to make it explicit that is what is going
    on.

    The two known exceptions that deliver a synchronous exception to a
    stopped ptraced task have already been changed to
    force_sig_fault_to_task.

    The callers have been changed with the following emacs regular expression
    (with obvious variations on the architectures that take more arguments)
    to avoid typos:

    force_sig_fault[(]\([^,]+\)[,]\([^,]+\)[,]\([^,]+\)[,]\W+current[)]
    ->
    force_sig_fault(\1,\2,\3)

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • In preparation for removing the task parameter from force_sig_fault
    introduce force_sig_fault_to_task and use it for the two cases where
    it matters.

    On mips force_fcr31_sig calls force_sig_fault and is called on either
    the current task, or a task that is suspended and is being switched to
    by the scheduler. This is safe because the task being switched to by
    the scheduler is guaranteed to be suspended. This ensures that
    task->sighand is stable while the signal is delivered to it.

    On parisc user_enable_single_step calls force_sig_fault and is in turn
    called by ptrace_request. The function ptrace_request always calls
    user_enable_single_step on a child that is stopped for tracing. The
    child being traced and not reaped ensures that child->sighand is not
    NULL, and that the child will not change child->sighand.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

27 May, 2019

3 commits


23 May, 2019

2 commits

  • The function send_signal was split from __send_signal so that it would
    be possible to bypass the namespace logic based upon current[1]. As it
    turns out the si_pid and the si_uid fixup are both inappropriate in
    the case of kill_pid_usb_asyncio so move that logic into send_signal.

    It is difficult to arrange but possible for a signal with an si_code
    of SI_TIMER or SI_SIGIO to be sent across namespace boundaries. In
    which case tests for when it is ok to change si_pid and si_uid based
    on SI_FROMUSER are incorrect. Replace the use of SI_FROMUSER with a
    new test has_si_pid_and_used based on siginfo_layout.

    Now that the uid fixup is no longer present after expanding
    SEND_SIG_NOINFO properly calculate the si_uid that the target
    task needs to read.

    [1] 7978b567d315 ("signals: add from_ancestor_ns parameter to send_signal()")
    Cc: stable@vger.kernel.org
    Fixes: 6588c1e3ff01 ("signals: SI_USER: Masquerade si_pid when crossing pid ns boundary")
    Fixes: 6b550f949594 ("user namespace: make signal.c respect user namespaces")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • The usb support for asyncio encoded one of it's values in the wrong
    field. It should have used si_value but instead used si_addr which is
    not present in the _rt union member of struct siginfo.

    The practical result of this is that on a 64bit big endian kernel
    when delivering a signal to a 32bit process the si_addr field
    is set to NULL, instead of the expected pointer value.

    This issue can not be fixed in copy_siginfo_to_user32 as the usb
    usage of the the _sigfault (aka si_addr) member of the siginfo
    union when SI_ASYNCIO is set is incompatible with the POSIX and
    glibc usage of the _rt member of the siginfo union.

    Therefore replace kill_pid_info_as_cred with kill_pid_usb_asyncio a
    dedicated function for this one specific case. There are no other
    users of kill_pid_info_as_cred so this specialization should have no
    impact on the amount of code in the kernel. Have kill_pid_usb_asyncio
    take instead of a siginfo_t which is difficult and error prone, 3
    arguments, a signal number, an errno value, and an address enconded as
    a sigval_t. The encoding of the address as a sigval_t allows the
    code that reads the userspace request for a signal to handle this
    compat issue along with all of the other compat issues.

    Add BUILD_BUG_ONs in kernel/signal.c to ensure that we can now place
    the pointer value at the in si_pid (instead of si_addr). That is the
    code now verifies that si_pid and si_addr always occur at the same
    location. Further the code veries that for native structures a value
    placed in si_pid and spilling into si_uid will appear in userspace in
    si_addr (on a byte by byte copy of siginfo or a field by field copy of
    siginfo). The code also verifies that for a 64bit kernel and a 32bit
    userspace the 32bit pointer will fit in si_pid.

    I have used the usbsig.c program below written by Alan Stern and
    slightly tweaked by me to run on a big endian machine to verify the
    issue exists (on sparc64) and to confirm the patch below fixes the issue.

    /* usbsig.c -- test USB async signal delivery */

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static struct usbdevfs_urb urb;
    static struct usbdevfs_disconnectsignal ds;
    static volatile sig_atomic_t done = 0;

    void urb_handler(int sig, siginfo_t *info , void *ucontext)
    {
    printf("Got signal %d, signo %d errno %d code %d addr: %p urb: %p\n",
    sig, info->si_signo, info->si_errno, info->si_code,
    info->si_addr, &urb);

    printf("%s\n", (info->si_addr == &urb) ? "Good" : "Bad");
    }

    void ds_handler(int sig, siginfo_t *info , void *ucontext)
    {
    printf("Got signal %d, signo %d errno %d code %d addr: %p ds: %p\n",
    sig, info->si_signo, info->si_errno, info->si_code,
    info->si_addr, &ds);

    printf("%s\n", (info->si_addr == &ds) ? "Good" : "Bad");
    done = 1;
    }

    int main(int argc, char **argv)
    {
    char *devfilename;
    int fd;
    int rc;
    struct sigaction act;
    struct usb_ctrlrequest *req;
    void *ptr;
    char buf[80];

    if (argc != 2) {
    fprintf(stderr, "Usage: usbsig device-file-name\n");
    return 1;
    }

    devfilename = argv[1];
    fd = open(devfilename, O_RDWR);
    if (fd == -1) {
    perror("Error opening device file");
    return 1;
    }

    act.sa_sigaction = urb_handler;
    sigemptyset(&act.sa_mask);
    act.sa_flags = SA_SIGINFO;

    rc = sigaction(SIGUSR1, &act, NULL);
    if (rc == -1) {
    perror("Error in sigaction");
    return 1;
    }

    act.sa_sigaction = ds_handler;
    sigemptyset(&act.sa_mask);
    act.sa_flags = SA_SIGINFO;

    rc = sigaction(SIGUSR2, &act, NULL);
    if (rc == -1) {
    perror("Error in sigaction");
    return 1;
    }

    memset(&urb, 0, sizeof(urb));
    urb.type = USBDEVFS_URB_TYPE_CONTROL;
    urb.endpoint = USB_DIR_IN | 0;
    urb.buffer = buf;
    urb.buffer_length = sizeof(buf);
    urb.signr = SIGUSR1;

    req = (struct usb_ctrlrequest *) buf;
    req->bRequestType = USB_DIR_IN | USB_TYPE_STANDARD | USB_RECIP_DEVICE;
    req->bRequest = USB_REQ_GET_DESCRIPTOR;
    req->wValue = htole16(USB_DT_DEVICE << 8);
    req->wIndex = htole16(0);
    req->wLength = htole16(sizeof(buf) - sizeof(*req));

    rc = ioctl(fd, USBDEVFS_SUBMITURB, &urb);
    if (rc == -1) {
    perror("Error in SUBMITURB ioctl");
    return 1;
    }

    rc = ioctl(fd, USBDEVFS_REAPURB, &ptr);
    if (rc == -1) {
    perror("Error in REAPURB ioctl");
    return 1;
    }

    memset(&ds, 0, sizeof(ds));
    ds.signr = SIGUSR2;
    ds.context = &ds;
    rc = ioctl(fd, USBDEVFS_DISCSIGNAL, &ds);
    if (rc == -1) {
    perror("Error in DISCSIGNAL ioctl");
    return 1;
    }

    printf("Waiting for usb disconnect\n");
    while (!done) {
    sleep(1);
    }

    close(fd);
    return 0;
    }

    Cc: Greg Kroah-Hartman
    Cc: linux-usb@vger.kernel.org
    Cc: Alan Stern
    Cc: Oliver Neukum
    Fixes: v2.3.39
    Cc: stable@vger.kernel.org
    Acked-by: Alan Stern
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

22 May, 2019

1 commit

  • When a process signals the audit daemon (shutdown, rotate, resume,
    reconfig) but syscall auditing is not enabled, we still want to know the
    identity of the process sending the signal to the audit daemon.

    Move audit_signal_info() out of syscall auditing to general auditing but
    create a new function audit_signal_info_syscall() to take care of the
    syscall dependent parts for when syscall auditing is enabled.

    Please see the github kernel audit issue
    https://github.com/linux-audit/audit-kernel/issues/111

    Signed-off-by: Richard Guy Briggs
    Signed-off-by: Paul Moore

    Richard Guy Briggs
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

17 May, 2019

1 commit

  • Alex Xu reported a regression in strace, caused by the introduction of
    the cgroup v2 freezer. The regression can be reproduced by stracing
    the following simple program:

    #include

    int main() {
    write(1, "a", 1);
    return 0;
    }

    An attempt to run strace ./a.out leads to the infinite loop:
    [ pre-main omitted ]
    write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
    write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
    write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
    write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
    write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
    write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
    [ repeats forever ]

    The problem occurs because the traced task leaves ptrace_stop()
    (and the signal handling loop) with the frozen bit set. So let's
    call cgroup_leave_frozen(true) unconditionally after sleeping
    in ptrace_stop().

    With this patch applied, strace works as expected:
    [ pre-main omitted ]
    write(1, "a", 1) = 1
    exit_group(0) = ?
    +++ exited with 0 +++

    Reported-by: Alex Xu
    Fixes: 76f969e8948d ("cgroup: cgroup v2 freezer")
    Signed-off-by: Roman Gushchin
    Acked-by: Oleg Nesterov
    Cc: Tejun Heo
    Signed-off-by: Tejun Heo

    Roman Gushchin
     

15 May, 2019

1 commit

  • There is a plan to build the kernel with -Wimplicit-fallthrough and this
    place in the code produced a warning (W=1).

    This commit remove the following warning:

    kernel/signal.c:795:13: warning: this statement may fall through [-Wimplicit-fallthrough=]

    Link: http://lkml.kernel.org/r/20190114203505.17875-1-malat@debian.org
    Signed-off-by: Mathieu Malaterre
    Acked-by: Gustavo A. R. Silva
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathieu Malaterre
     

10 May, 2019

1 commit

  • Pull cgroup updates from Tejun Heo:
    "This includes Roman's cgroup2 freezer implementation.

    It's a separate machanism from cgroup1 freezer. Instead of blocking
    user tasks in arbitrary uninterruptible sleeps, the new implementation
    extends jobctl stop - frozen tasks are trapped in jobctl stop until
    thawed and can be killed and ptraced. Lots of thanks to Oleg for
    sheperding the effort.

    Other than that, there are a few trivial changes"

    * 'for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: never call do_group_exit() with task->frozen bit set
    kernel: cgroup: fix misuse of %x
    cgroup: get rid of cgroup_freezer_frozen_exit()
    cgroup: prevent spurious transition into non-frozen state
    cgroup: Remove unused cgrp variable
    cgroup: document cgroup v2 freezer interface
    cgroup: add tracing points for cgroup v2 freezer
    cgroup: make TRACE_CGROUP_PATH irq-safe
    kselftests: cgroup: add freezer controller self-tests
    kselftests: cgroup: don't fail on cg_kill_all() error in cg_destroy()
    cgroup: cgroup v2 freezer
    cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock
    cgroup: implement __cgroup_task_count() helper
    cgroup: rename freezer.c into legacy_freezer.c
    cgroup: remove extra cgroup_migrate_finish() call

    Linus Torvalds
     

09 May, 2019

1 commit

  • I've got two independent reports that cgroup_task_frozen() check
    in cgroup_exit() has been triggered by lkp libhugetlbfs-test and
    LTP ptrace01 tests.

    For example:
    [ 44.576072] WARNING: CPU: 1 PID: 3028 at kernel/cgroup/cgroup.c:5932 cgroup_exit+0x148/0x160
    [ 44.577724] Modules linked in: crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sr_mod cdrom
    bochs_drm sg ttm ata_generic pata_acpi ppdev drm_kms_helper snd_pcm syscopyarea aesni_intel snd_timer
    sysfillrect sysimgblt snd crypto_simd cryptd glue_helper soundcore fb_sys_fops joydev drm serio_raw pcspkr
    ata_piix libata i2c_piix4 floppy parport_pc parport ip_tables
    [ 44.583106] CPU: 1 PID: 3028 Comm: ptrace-write-hu Not tainted 5.1.0-rc3-00053-g9262503 #5
    [ 44.584600] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
    [ 44.586116] RIP: 0010:cgroup_exit+0x148/0x160
    [ 44.587135] Code: 0f 84 50 ff ff ff 48 8b 85 c8 0c 00 00 48 8b 78 70 e8 ec 2e 00 00 e9 3b ff ff ff f0 ff 43 60
    0f 88 72 21 89 00 e9 48 ff ff ff 0b e9 1b ff ff ff e8 3c 73 f4 ff 66 90 66 2e 0f 1f 84 00 00 00
    [ 44.590113] RSP: 0018:ffffb25702dcfd30 EFLAGS: 00010002
    [ 44.591167] RAX: ffff96a7fee32410 RBX: ffff96a7ff1d6000 RCX: dead000000000200
    [ 44.592446] RDX: ffff96a7ff1d6080 RSI: ffff96a7fec75290 RDI: ffff96a7fec75290
    [ 44.593715] RBP: ffff96a7fec745c0 R08: ffff96a7fec74658 R09: 0000000000000000
    [ 44.594985] R10: 0000000000000000 R11: 0000000000000001 R12: ffff96a7fec75101
    [ 44.596266] R13: ffff96a7fec745c0 R14: ffff96a7ff3bde30 R15: ffff96a7fec75130
    [ 44.597550] FS: 0000000000000000(0000) GS:ffff96a7dd700000(0000) knlGS:0000000000000000
    [ 44.598950] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
    [ 44.600098] CR2: 00000000f7a00000 CR3: 000000000d20e000 CR4: 00000000000406e0
    [ 44.601417] Call Trace:
    [ 44.602777] do_exit+0x337/0xc40
    [ 44.603677] do_group_exit+0x3a/0xa0
    [ 44.604610] get_signal+0x12e/0x8d0
    [ 44.605533] ? __switch_to_asm+0x40/0x70
    [ 44.606503] do_signal+0x36/0x650
    [ 44.607409] ? __switch_to_asm+0x40/0x70
    [ 44.608383] ? __schedule+0x267/0x860
    [ 44.609329] exit_to_usermode_loop+0x89/0xf0
    [ 44.610349] do_fast_syscall_32+0x251/0x2e3
    [ 44.611357] entry_SYSENTER_compat+0x7f/0x91
    [ 44.612376] ---[ end trace e4ca5cfc4b7f7964 ]---

    The problem is caused by the ptrace_signal() call in the for loop
    in get_signal(). There is a cgroup_enter_frozen() call inside
    ptrace_signal(), so after exit from ptrace_signal() the task->frozen
    bit might be set. In this case do_group_exit() can be called with the
    task->frozen bit set and trigger the warning. This is only place where
    we can leave the loop with the task->frozen bit set and without
    setting JOBCTL_TRAP_FREEZE and TIF_SIGPENDING.

    To resolve this problem, let's move cgroup_leave_frozen(true) call to
    just after the fatal label. If the task is going to die, the frozen
    bit must be cleared no matter how we get into this point.

    Reported-by: kernel test robot
    Reported-by: Qian Cai
    Cc: Oleg Nesterov
    Cc: Tejun Heo
    Signed-off-by: Roman Gushchin
    Signed-off-by: Tejun Heo

    Roman Gushchin
     

08 May, 2019

1 commit

  • Pull pidfd updates from Christian Brauner:
    "This patchset makes it possible to retrieve pidfds at process creation
    time by introducing the new flag CLONE_PIDFD to the clone() system
    call. Linus originally suggested to implement this as a new flag to
    clone() instead of making it a separate system call.

    After a thorough review from Oleg CLONE_PIDFD returns pidfds in the
    parent_tidptr argument. This means we can give back the associated pid
    and the pidfd at the same time. Access to process metadata information
    thus becomes rather trivial.

    As has been agreed, CLONE_PIDFD creates file descriptors based on
    anonymous inodes similar to the new mount api. They are made
    unconditional by this patchset as they are now needed by core kernel
    code (vfs, pidfd) even more than they already were before (timerfd,
    signalfd, io_uring, epoll etc.). The core patchset is rather small.
    The bulky looking changelist is caused by David's very simple changes
    to Kconfig to make anon inodes unconditional.

    A pidfd comes with additional information in fdinfo if the kernel
    supports procfs. The fdinfo file contains the pid of the process in
    the callers pid namespace in the same format as the procfs status
    file, i.e. "Pid:\t%d".

    To remove worries about missing metadata access this patchset comes
    with a sample/test program that illustrates how a combination of
    CLONE_PIDFD and pidfd_send_signal() can be used to gain race-free
    access to process metadata through /proc/.

    Further work based on this patchset has been done by Joel. His work
    makes pidfds pollable. It finished too late for this merge window. I
    would prefer to have it sitting in linux-next for a while and send it
    for inclusion during the 5.3 merge window"

    * tag 'pidfd-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    samples: show race-free pidfd metadata access
    signal: support CLONE_PIDFD with pidfd_send_signal
    clone: add CLONE_PIDFD
    Make anon_inodes unconditional

    Linus Torvalds
     

07 May, 2019

1 commit

  • Let pidfd_send_signal() use pidfds retrieved via CLONE_PIDFD. With this
    patch pidfd_send_signal() becomes independent of procfs. This fullfils
    the request made when we merged the pidfd_send_signal() patchset. The
    pidfd_send_signal() syscall is now always available allowing for it to
    be used by users without procfs mounted or even users without procfs
    support compiled into the kernel.

    Signed-off-by: Christian Brauner
    Co-developed-by: Jann Horn
    Signed-off-by: Jann Horn
    Acked-by: Oleg Nesterov
    Cc: Arnd Bergmann
    Cc: "Eric W. Biederman"
    Cc: Kees Cook
    Cc: Thomas Gleixner
    Cc: David Howells
    Cc: "Michael Kerrisk (man-pages)"
    Cc: Andy Lutomirsky
    Cc: Andrew Morton
    Cc: Aleksa Sarai
    Cc: Linus Torvalds
    Cc: Al Viro

    Christian Brauner
     

06 May, 2019

1 commit

  • If freezing of a cgroup races with waking of a task from
    the frozen state (like waiting in vfork() or in do_signal_stop()),
    a spurious transition of the cgroup state can happen.

    The task enters cgroup_leave_frozen(true), the cgroup->nr_frozen_tasks
    counter decrements, and the cgroup is switched to the unfrozen state.

    To prevent it, let's reserve cgroup_leave_frozen(true) for
    terminating processes and use cgroup_leave_frozen(false) otherwise.

    To avoid busy-looping in the signal handling loop waiting
    for JOBCTL_TRAP_FREEZE set from the cgroup freezing path,
    let's do it explicitly in cgroup_leave_frozen(), if the task
    is going to stay frozen.

    Suggested-by: Oleg Nesterov
    Signed-off-by: Roman Gushchin
    Signed-off-by: Tejun Heo

    Roman Gushchin
     

20 Apr, 2019

1 commit

  • Cgroup v1 implements the freezer controller, which provides an ability
    to stop the workload in a cgroup and temporarily free up some
    resources (cpu, io, network bandwidth and, potentially, memory)
    for some other tasks. Cgroup v2 lacks this functionality.

    This patch implements freezer for cgroup v2.

    Cgroup v2 freezer tries to put tasks into a state similar to jobctl
    stop. This means that tasks can be killed, ptraced (using
    PTRACE_SEIZE*), and interrupted. It is possible to attach to
    a frozen task, get some information (e.g. read registers) and detach.
    It's also possible to migrate a frozen tasks to another cgroup.

    This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
    tried to imitate the system-wide freezer. However uninterruptible
    sleep is fine when all tasks are going to be frozen (hibernation case),
    it's not the acceptable state for some subset of the system.

    Cgroup v2 freezer is not supporting freezing kthreads.
    If a non-root cgroup contains kthread, the cgroup still can be frozen,
    but the kthread will remain running, the cgroup will be shown
    as non-frozen, and the notification will not be delivered.

    * PTRACE_ATTACH is not working because non-fatal signal delivery
    is blocked in frozen state.

    There are some interface differences between cgroup v1 and cgroup v2
    freezer too, which are required to conform the cgroup v2 interface
    design principles:
    1) There is no separate controller, which has to be turned on:
    the functionality is always available and is represented by
    cgroup.freeze and cgroup.events cgroup control files.
    2) The desired state is defined by the cgroup.freeze control file.
    Any hierarchical configuration is allowed.
    3) The interface is asynchronous. The actual state is available
    using cgroup.events control file ("frozen" field). There are no
    dedicated transitional states.
    4) It's allowed to make any changes with the cgroup hierarchy
    (create new cgroups, remove old cgroups, move tasks between cgroups)
    no matter if some cgroups are frozen.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Tejun Heo
    No-objection-from-me-by: Oleg Nesterov
    Cc: kernel-team@fb.com

    Roman Gushchin