07 Jan, 2021

1 commit

  • Free the pages parallely for a task that receives SIGKILL, from ULMK
    process, using the oom_reaper. This freeing of pages will help to give
    the pages to buddy system well advance.

    Add the boot param, reap_mem_when_killed_by=, that configures the
    process name, the kill signal to a process from which makes its memory
    reaped by oom reaper.

    As an example, when reap_mem_when_killed_by=lmkd, then all the processes
    that receives the kill signal from lmkd is added to oom reaper.

    Not initializing this param makes this feature disabled.

    Change-Id: I21adb95de5e380a80d7eb0b87d9b5b553f52e28a
    Bug: 171763461
    Signed-off-by: Charan Teja Reddy
    Signed-off-by: Isaac J. Manjarres

    Charan Teja Reddy
     

06 Nov, 2020

1 commit


03 Nov, 2020

1 commit

  • This testcase

    #include
    #include
    #include
    #include
    #include
    #include
    #include

    void *tf(void *arg)
    {
    return NULL;
    }

    int main(void)
    {
    int pid = fork();
    if (!pid) {
    kill(getpid(), SIGSTOP);

    pthread_t th;
    pthread_create(&th, NULL, tf, NULL);

    return 0;
    }

    waitpid(pid, NULL, WSTOPPED);

    ptrace(PTRACE_SEIZE, pid, 0, PTRACE_O_TRACECLONE);
    waitpid(pid, NULL, 0);

    ptrace(PTRACE_CONT, pid, 0,0);
    waitpid(pid, NULL, 0);

    int status;
    int thread = waitpid(-1, &status, 0);
    assert(thread > 0 && thread != pid);
    assert(status == 0x80137f);

    return 0;
    }

    fails and triggers WARN_ON_ONCE(!signr) in do_jobctl_trap().

    This is because task_join_group_stop() has 2 problems when current is traced:

    1. We can't rely on the "JOBCTL_STOP_PENDING" check, a stopped tracee
    can be woken up by debugger and it can clone another thread which
    should join the group-stop.

    We need to check group_stop_count || SIGNAL_STOP_STOPPED.

    2. If SIGNAL_STOP_STOPPED is already set, we should not increment
    sig->group_stop_count and add JOBCTL_STOP_CONSUME. The new thread
    should stop without another do_notify_parent_cldstop() report.

    To clarify, the problem is very old and we should blame
    ptrace_init_task(). But now that we have task_join_group_stop() it makes
    more sense to fix this helper to avoid the code duplication.

    Reported-by: syzbot+3485e3773f7da290eecc@syzkaller.appspotmail.com
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Cc: Jens Axboe
    Cc: Christian Brauner
    Cc: "Eric W . Biederman"
    Cc: Zhiqiang Liu
    Cc: Tejun Heo
    Cc:
    Link: https://lkml.kernel.org/r/20201019134237.GA18810@redhat.com
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

01 Sep, 2020

1 commit


24 Aug, 2020

1 commit

  • Replace the existing /* fall through */ comments and its variants with
    the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
    fall-through markings when it is the case.

    [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva

    Gustavo A. R. Silva
     

17 Aug, 2020

1 commit


13 Aug, 2020

1 commit

  • If JOBCTL_TASK_WORK is already set on the targeted task, then we need
    not go through {lock,unlock}_task_sighand() to set it again and queue
    a signal wakeup. This is safe as we're checking it _after_ adding the
    new task_work with cmpxchg().

    The ordering is as follows:

    task_work_add() get_signal()
    --------------------------------------------------------------
    STORE(task->task_works, new_work); STORE(task->jobctl);
    mb(); mb();
    LOAD(task->jobctl); LOAD(task->task_works);

    This speeds up TWA_SIGNAL handling quite a bit, which is important now
    that io_uring is relying on it for all task_work deliveries.

    Cc: Peter Zijlstra
    Cc: Jann Horn
    Acked-by: Oleg Nesterov
    Signed-off-by: Jens Axboe

    Jens Axboe
     

12 Aug, 2020

1 commit


27 Jul, 2020

1 commit


01 Jul, 2020

1 commit

  • So that the target task will exit the wait_event_interruptible-like
    loop and call task_work_run() asap.

    The patch turns "bool notify" into 0,TWA_RESUME,TWA_SIGNAL enum, the
    new TWA_SIGNAL flag implies signal_wake_up(). However, it needs to
    avoid the race with recalc_sigpending(), so the patch also adds the
    new JOBCTL_TASK_WORK bit included in JOBCTL_PENDING_MASK.

    TODO: once this patch is merged we need to change all current users
    of task_work_add(notify = true) to use TWA_RESUME.

    Cc: stable@vger.kernel.org # v5.7
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Jens Axboe

    Oleg Nesterov
     

02 Jun, 2020

1 commit

  • Pull uaccess/coredump updates from Al Viro:
    "set_fs() removal in coredump-related area - mostly Christoph's
    stuff..."

    * 'work.set_fs-exec' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    binfmt_elf_fdpic: remove the set_fs(KERNEL_DS) in elf_fdpic_core_dump
    binfmt_elf: remove the set_fs(KERNEL_DS) in elf_core_dump
    binfmt_elf: remove the set_fs in fill_siginfo_note
    signal: refactor copy_siginfo_to_user32
    powerpc/spufs: simplify spufs core dumping
    powerpc/spufs: stop using access_ok
    powerpc/spufs: fix copy_to_user while atomic

    Linus Torvalds
     

06 May, 2020

1 commit

  • Factor out a copy_siginfo_to_external32 helper from
    copy_siginfo_to_user32 that fills out the compat_siginfo, but does so
    on a kernel space data structure. With that we can let architectures
    override copy_siginfo_to_user32 with their own implementations using
    copy_siginfo_to_external32. That allows moving the x32 SIGCHLD purely
    to x86 architecture code.

    As a nice side effect copy_siginfo_to_external32 also comes in handy
    for avoiding a set_fs() call in the coredump code later on.

    Contains improvements from Eric W. Biederman
    and Arnd Bergmann .

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

24 Apr, 2020

1 commit


21 Apr, 2020

1 commit

  • Christof Meerwald writes:
    > Hi,
    >
    > this is probably related to commit
    > 7a0cf094944e2540758b7f957eb6846d5126f535 (signal: Correct namespace
    > fixups of si_pid and si_uid).
    >
    > With a 5.6.5 kernel I am seeing SIGCHLD signals that don't include a
    > properly set si_pid field - this seems to happen for multi-threaded
    > child processes.
    >
    > A simple test program (based on the sample from the signalfd man page):
    >
    > #include
    > #include
    > #include
    > #include
    > #include
    > #include
    >
    > #define handle_error(msg) \
    > do { perror(msg); exit(EXIT_FAILURE); } while (0)
    >
    > int main(int argc, char *argv[])
    > {
    > sigset_t mask;
    > int sfd;
    > struct signalfd_siginfo fdsi;
    > ssize_t s;
    >
    > sigemptyset(&mask);
    > sigaddset(&mask, SIGCHLD);
    >
    > if (sigprocmask(SIG_BLOCK, &mask, NULL) == -1)
    > handle_error("sigprocmask");
    >
    > pid_t chldpid;
    > char *chldargv[] = { "./sfdclient", NULL };
    > posix_spawn(&chldpid, "./sfdclient", NULL, NULL, chldargv, NULL);
    >
    > sfd = signalfd(-1, &mask, 0);
    > if (sfd == -1)
    > handle_error("signalfd");
    >
    > for (;;) {
    > s = read(sfd, &fdsi, sizeof(struct signalfd_siginfo));
    > if (s != sizeof(struct signalfd_siginfo))
    > handle_error("read");
    >
    > if (fdsi.ssi_signo == SIGCHLD) {
    > printf("Got SIGCHLD %d %d %d %d\n",
    > fdsi.ssi_status, fdsi.ssi_code,
    > fdsi.ssi_uid, fdsi.ssi_pid);
    > return 0;
    > } else {
    > printf("Read unexpected signal\n");
    > }
    > }
    > }
    >
    >
    > and a multi-threaded client to test with:
    >
    > #include
    > #include
    >
    > void *f(void *arg)
    > {
    > sleep(100);
    > }
    >
    > int main()
    > {
    > pthread_t t[8];
    >
    > for (int i = 0; i != 8; ++i)
    > {
    > pthread_create(&t[i], NULL, f, NULL);
    > }
    > }
    >
    > I tried to do a bit of debugging and what seems to be happening is
    > that
    >
    > /* From an ancestor pid namespace? */
    > if (!task_pid_nr_ns(current, task_active_pid_ns(t))) {
    >
    > fails inside task_pid_nr_ns because the check for "pid_alive" fails.
    >
    > This code seems to be called from do_notify_parent and there we
    > actually have "tsk != current" (I am assuming both are threads of the
    > current process?)

    I instrumented the code with a warning and received the following backtrace:
    > WARNING: CPU: 0 PID: 777 at kernel/pid.c:501 __task_pid_nr_ns.cold.6+0xc/0x15
    > Modules linked in:
    > CPU: 0 PID: 777 Comm: sfdclient Not tainted 5.7.0-rc1userns+ #2924
    > Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    > RIP: 0010:__task_pid_nr_ns.cold.6+0xc/0x15
    > Code: ff 66 90 48 83 ec 08 89 7c 24 04 48 8d 7e 08 48 8d 74 24 04 e8 9a b6 44 00 48 83 c4 08 c3 48 c7 c7 59 9f ac 82 e8 c2 c4 04 00 0b e9 3fd
    > RSP: 0018:ffffc9000042fbf8 EFLAGS: 00010046
    > RAX: 000000000000000c RBX: 0000000000000000 RCX: ffffc9000042faf4
    > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff81193d29
    > RBP: ffffc9000042fc18 R08: 0000000000000000 R09: 0000000000000001
    > R10: 000000100f938416 R11: 0000000000000309 R12: ffff8880b941c140
    > R13: 0000000000000000 R14: 0000000000000000 R15: ffff8880b941c140
    > FS: 0000000000000000(0000) GS:ffff8880bca00000(0000) knlGS:0000000000000000
    > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    > CR2: 00007f2e8c0a32e0 CR3: 0000000002e10000 CR4: 00000000000006f0
    > Call Trace:
    > send_signal+0x1c8/0x310
    > do_notify_parent+0x50f/0x550
    > release_task.part.21+0x4fd/0x620
    > do_exit+0x6f6/0xaf0
    > do_group_exit+0x42/0xb0
    > get_signal+0x13b/0xbb0
    > do_signal+0x2b/0x670
    > ? __audit_syscall_exit+0x24d/0x2b0
    > ? rcu_read_lock_sched_held+0x4d/0x60
    > ? kfree+0x24c/0x2b0
    > do_syscall_64+0x176/0x640
    > ? trace_hardirqs_off_thunk+0x1a/0x1c
    > entry_SYSCALL_64_after_hwframe+0x49/0xb3

    The immediate problem is as Christof noticed that "pid_alive(current) == false".
    This happens because do_notify_parent is called from the last thread to exit
    in a process after that thread has been reaped.

    The bigger issue is that do_notify_parent can be called from any
    process that manages to wait on a thread of a multi-threaded process
    from wait_task_zombie. So any logic based upon current for
    do_notify_parent is just nonsense, as current can be pretty much
    anything.

    So change do_notify_parent to call __send_signal directly.

    Inspecting the code it appears this problem has existed since the pid
    namespace support started handling this case in 2.6.30. This fix only
    backports to 7a0cf094944e ("signal: Correct namespace fixups of si_pid and si_uid")
    where the problem logic was moved out of __send_signal and into send_signal.

    Cc: stable@vger.kernel.org
    Fixes: 6588c1e3ff01 ("signals: SI_USER: Masquerade si_pid when crossing pid ns boundary")
    Ref: 921cf9f63089 ("signals: protect cinit from unblocked SIG_DFL signals")
    Link: https://lore.kernel.org/lkml/20200419201336.GI22017@edge.cmeerw.net/
    Reported-by: Christof Meerwald
    Acked-by: Oleg Nesterov
    Acked-by: Christian Brauner
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

13 Apr, 2020

2 commits


03 Apr, 2020

1 commit

  • Pull exec/proc updates from Eric Biederman:
    "This contains two significant pieces of work: the work to sort out
    proc_flush_task, and the work to solve a deadlock between strace and
    exec.

    Fixing proc_flush_task so that it no longer requires a persistent
    mount makes improvements to proc possible. The removal of the
    persistent mount solves an old regression that that caused the hidepid
    mount option to only work on remount not on mount. The regression was
    found and reported by the Android folks. This further allows Alexey
    Gladkov's work making proc mount options specific to an individual
    mount of proc to move forward.

    The work on exec starts solving a long standing issue with exec that
    it takes mutexes of blocking userspace applications, which makes exec
    extremely deadlock prone. For the moment this adds a second mutex with
    a narrower scope that handles all of the easy cases. Which makes the
    tricky cases easy to spot. With a little luck the code to solve those
    deadlocks will be ready by next merge window"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (25 commits)
    signal: Extend exec_id to 64bits
    pidfd: Use new infrastructure to fix deadlocks in execve
    perf: Use new infrastructure to fix deadlocks in execve
    proc: io_accounting: Use new infrastructure to fix deadlocks in execve
    proc: Use new infrastructure to fix deadlocks in execve
    kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
    kernel: doc: remove outdated comment cred.c
    mm: docs: Fix a comment in process_vm_rw_core
    selftests/ptrace: add test cases for dead-locks
    exec: Fix a deadlock in strace
    exec: Add exec_update_mutex to replace cred_guard_mutex
    exec: Move exec_mmap right after de_thread in flush_old_exec
    exec: Move cleanup of posix timers on exec out of de_thread
    exec: Factor unshare_sighand out of de_thread and call it separately
    exec: Only compute current once in flush_old_exec
    pid: Improve the comment about waiting in zap_pid_ns_processes
    proc: Remove the now unnecessary internal mount of proc
    uml: Create a private mount of proc for mconsole
    uml: Don't consult current to find the proc_mnt in mconsole_proc
    proc: Use a list of inodes to flush from proc
    ...

    Linus Torvalds
     

02 Apr, 2020

1 commit

  • Replace the 32bit exec_id with a 64bit exec_id to make it impossible
    to wrap the exec_id counter. With care an attacker can cause exec_id
    wrap and send arbitrary signals to a newly exec'd parent. This
    bypasses the signal sending checks if the parent changes their
    credentials during exec.

    The severity of this problem can been seen that in my limited testing
    of a 32bit exec_id it can take as little as 19s to exec 65536 times.
    Which means that it can take as little as 14 days to wrap a 32bit
    exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7
    days. Even my slower timing is in the uptime of a typical server.
    Which means self_exec_id is simply a speed bump today, and if exec
    gets noticably faster self_exec_id won't even be a speed bump.

    Extending self_exec_id to 64bits introduces a problem on 32bit
    architectures where reading self_exec_id is no longer atomic and can
    take two read instructions. Which means that is is possible to hit
    a window where the read value of exec_id does not match the written
    value. So with very lucky timing after this change this still
    remains expoiltable.

    I have updated the update of exec_id on exec to use WRITE_ONCE
    and the read of exec_id in do_notify_parent to use READ_ONCE
    to make it clear that there is no locking between these two
    locations.

    Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl
    Fixes: 2.3.23pre2
    Cc: stable@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

27 Feb, 2020

1 commit

  • When queueing a signal, we increment both the users count of pending
    signals (for RLIMIT_SIGPENDING tracking) and we increment the refcount
    of the user struct itself (because we keep a reference to the user in
    the signal structure in order to correctly account for it when freeing).

    That turns out to be fairly expensive, because both of them are atomic
    updates, and particularly under extreme signal handling pressure on big
    machines, you can get a lot of cache contention on the user struct.
    That can then cause horrid cacheline ping-pong when you do these
    multiple accesses.

    So change the reference counting to only pin the user for the _first_
    pending signal, and to unpin it when the last pending signal is
    dequeued. That means that when a user sees a lot of concurrent signal
    queuing - which is the only situation when this matters - the only
    atomic access needed is generally the 'sigpending' count update.

    This was noticed because of a particularly odd timing artifact on a
    dual-socket 96C/192T Cascade Lake platform: when you get into bad
    contention, on that machine for some reason seems to be much worse when
    the contention happens in the upper 32-byte half of the cacheline.

    As a result, the kernel test robot will-it-scale 'signal1' benchmark had
    an odd performance regression simply due to random alignment of the
    'struct user_struct' (and pointed to a completely unrelated and
    apparently nonsensical commit for the regression).

    Avoiding the double increments (and decrements on the dequeueing side,
    of course) makes for much less contention and hugely improved
    performance on that will-it-scale microbenchmark.

    Quoting Feng Tang:

    "It makes a big difference, that the performance score is tripled! bump
    from original 17000 to 54000. Also the gap between 5.0-rc6 and
    5.0-rc6+Jiri's patch is reduced to around 2%"

    [ The "2% gap" is the odd cacheline placement difference on that
    platform: under the extreme contention case, the effect of which half
    of the cacheline was hot was 5%, so with the reduced contention the
    odd timing artifact is reduced too ]

    It does help in the non-contended case too, but is not nearly as
    noticeable.

    Reported-and-tested-by: Feng Tang
    Cc: Eric W. Biederman
    Cc: Huang, Ying
    Cc: Philip Li
    Cc: Andi Kleen
    Cc: Jiri Olsa
    Cc: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

26 Jan, 2020

1 commit

  • This patch fixes the following sparse errors by annotating the
    sighand_struct with __rcu

    kernel/fork.c:1511:9: error: incompatible types in comparison expression
    kernel/exit.c:100:19: error: incompatible types in comparison expression
    kernel/signal.c:1370:27: error: incompatible types in comparison expression

    This fix introduces the following sparse error in signal.c due to
    checking the sighand pointer without rcu primitives:

    kernel/signal.c:1386:21: error: incompatible types in comparison expression

    This new sparse error is also fixed in this patch.

    Signed-off-by: Madhuparna Bhowmik
    Acked-by: Paul E. McKenney
    Link: https://lore.kernel.org/r/20200124045908.26389-1-madhuparnabhowmik10@gmail.com
    Signed-off-by: Christian Brauner

    Madhuparna Bhowmik
     

11 Oct, 2019

1 commit


17 Sep, 2019

1 commit

  • Pull pidfd/waitid updates from Christian Brauner:
    "This contains two features and various tests.

    First, it adds support for waiting on process through pidfds by adding
    the P_PIDFD type to the waitid() syscall. This completes the basic
    functionality of the pidfd api (cf. [1]). In the meantime we also have
    a new adition to the userspace projects that make use of the pidfd
    api. The qt project was nice enough to send a mail pointing out that
    they have a pr up to switch to the pidfd api (cf. [2]).

    Second, this tag contains an extension to the waitid() syscall to make
    it possible to wait on the current process group in a race free manner
    (even though the actual problem is very unlikely) by specifing 0
    together with the P_PGID type. This extension traces back to a
    discussion on the glibc development mailing list.

    There are also a range of tests for the features above. Additionally,
    the test-suite which detected the pidfd-polling race we fixed in [3]
    is included in this tag"

    [1] https://lwn.net/Articles/794707/
    [2] https://codereview.qt-project.org/c/qt/qtbase/+/108456
    [3] commit b191d6491be6 ("pidfd: fix a poll race when setting exit_state")

    * tag 'core-process-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    waitid: Add support for waiting for the current process group
    tests: add pidfd poll tests
    tests: move common definitions and functions into pidfd.h
    pidfd: add pidfd_wait tests
    pidfd: add P_PIDFD to waitid()

    Linus Torvalds
     

19 Aug, 2019

1 commit

  • My recent to change to only use force_sig for a synchronous events
    wound up breaking signal reception cifs and drbd. I had overlooked
    the fact that by default kthreads start out with all signals set to
    SIG_IGN. So a change I thought was safe turned out to have made it
    impossible for those kernel thread to catch their signals.

    Reverting the work on force_sig is a bad idea because what the code
    was doing was very much a misuse of force_sig. As the way force_sig
    ultimately allowed the signal to happen was to change the signal
    handler to SIG_DFL. Which after the first signal will allow userspace
    to send signals to these kernel threads. At least for
    wake_ack_receiver in drbd that does not appear actively wrong.

    So correct this problem by adding allow_kernel_signal that will allow
    signals whose siginfo reports they were sent by the kernel through,
    but will not allow userspace generated signals, and update cifs and
    drbd to call allow_kernel_signal in an appropriate place so that their
    thread can receive this signal.

    Fixing things this way ensures that userspace won't be able to send
    signals and cause problems, that it is clear which signals the
    threads are expecting to receive, and it guarantees that nothing
    else in the system will be affected.

    This change was partly inspired by similar cifs and drbd patches that
    added allow_signal.

    Reported-by: ronnie sahlberg
    Reported-by: Christoph Böhmwalder
    Tested-by: Christoph Böhmwalder
    Cc: Steve French
    Cc: Philipp Reisner
    Cc: David Laight
    Fixes: 247bc9470b1e ("cifs: fix rmmod regression in cifs.ko caused by force_sig changes")
    Fixes: 72abe3bcf091 ("signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig")
    Fixes: fee109901f39 ("signal/drbd: Use send_sig not force_sig")
    Fixes: 3cf5d076fb4d ("signal: Remove task parameter from force_sig")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

03 Aug, 2019

1 commit

  • The kernel-doc parser doesn't handle expressions with %foo*. Instead,
    when an asterisk should be part of a constant, it uses an alternative
    notation: `foo*`.

    Link: http://lkml.kernel.org/r/7f18c2e0b5e39e6b7eb55ddeb043b8b260b49f2d.1563361575.git.mchehab+samsung@kernel.org
    Signed-off-by: Mauro Carvalho Chehab
    Cc: Deepa Dinamani
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mauro Carvalho Chehab
     

02 Aug, 2019

1 commit

  • This adds the P_PIDFD type to waitid().
    One of the last remaining bits for the pidfd api is to make it possible
    to wait on pidfds. With P_PIDFD added to waitid() the parts of userspace
    that want to use the pidfd api to exclusively manage processes can do so
    now.

    One of the things this will unblock in the future is the ability to make
    it possible to retrieve the exit status via waitid(P_PIDFD) for
    non-parent processes if handed a _suitable_ pidfd that has this feature
    set. This is similar to what you can do on FreeBSD with kqueue(). It
    might even end up being possible to wait on a process as a non-parent if
    an appropriate property is enabled on the pidfd.

    With P_PIDFD no scoping of the process identified by the pidfd is
    possible, i.e. it explicitly blocks things such as wait4(-1), wait4(0),
    waitid(P_ALL), waitid(P_PGID) etc. It only allows for semantics
    equivalent to wait4(pid), waitid(P_PID). Users that need scoping should
    rely on pid-based wait*() syscalls for now.

    Signed-off-by: Christian Brauner
    Reviewed-by: Kees Cook
    Reviewed-by: Oleg Nesterov
    Cc: Arnd Bergmann
    Cc: "Eric W. Biederman"
    Cc: Joel Fernandes (Google)
    Cc: Thomas Gleixner
    Cc: David Howells
    Cc: Jann Horn
    Cc: Andy Lutomirsky
    Cc: Andrew Morton
    Cc: Aleksa Sarai
    Cc: Linus Torvalds
    Cc: Al Viro
    Link: https://lore.kernel.org/r/20190727222229.6516-2-christian@brauner.io

    Christian Brauner
     

29 Jul, 2019

1 commit

  • Previously a condition got missed where the pidfd waiters are awakened
    before the exit_state gets set. This can result in a missed notification
    [1] and the polling thread waiting forever.

    It is fixed now, however it would be nice to avoid this kind of issue
    going unnoticed in the future. So just add a warning to catch it in the
    future.

    /* References */
    [1]: https://lore.kernel.org/lkml/20190717172100.261204-1-joel@joelfernandes.org/

    Signed-off-by: Joel Fernandes (Google)
    Link: https://lore.kernel.org/r/20190724164816.201099-1-joel@joelfernandes.org
    Signed-off-by: Christian Brauner

    Joel Fernandes (Google)
     

17 Jul, 2019

1 commit

  • task->saved_sigmask and ->restore_sigmask are only used in the ret-from-
    syscall paths. This means that set_user_sigmask() can save ->blocked in
    ->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked
    was modified.

    This way the callers do not need 2 sigset_t's passed to set/restore and
    restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns
    into the trivial helper which just calls restore_saved_sigmask().

    Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.com
    Signed-off-by: Oleg Nesterov
    Cc: Deepa Dinamani
    Cc: Arnd Bergmann
    Cc: Jens Axboe
    Cc: Davidlohr Bueso
    Cc: Eric Wong
    Cc: Jason Baron
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc: Eric W. Biederman
    Cc: David Laight
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

11 Jul, 2019

1 commit

  • Pull pidfd updates from Christian Brauner:
    "This adds two main features.

    - First, it adds polling support for pidfds. This allows process
    managers to know when a (non-parent) process dies in a race-free
    way.

    The notification mechanism used follows the same logic that is
    currently used when the parent of a task is notified of a child's
    death. With this patchset it is possible to put pidfds in an
    {e}poll loop and get reliable notifications for process (i.e.
    thread-group) exit.

    - The second feature compliments the first one by making it possible
    to retrieve pollable pidfds for processes that were not created
    using CLONE_PIDFD.

    A lot of processes get created with traditional PID-based calls
    such as fork() or clone() (without CLONE_PIDFD). For these
    processes a caller can currently not create a pollable pidfd. This
    is a problem for Android's low memory killer (LMK) and service
    managers such as systemd.

    Both patchsets are accompanied by selftests.

    It's perhaps worth noting that the work done so far and the work done
    in this branch for pidfd_open() and polling support do already see
    some adoption:

    - Android is in the process of backporting this work to all their LTS
    kernels [1]

    - Service managers make use of pidfd_send_signal but will need to
    wait until we enable waiting on pidfds for full adoption.

    - And projects I maintain make use of both pidfd_send_signal and
    CLONE_PIDFD [2] and will use polling support and pidfd_open() too"

    [1] https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.9+backport%22
    https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.14+backport%22
    https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.19+backport%22

    [2] https://github.com/lxc/lxc/blob/aab6e3eb73c343231cdde775db938994fc6f2803/src/lxc/start.c#L1753

    * tag 'pidfd-updates-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    tests: add pidfd_open() tests
    arch: wire-up pidfd_open()
    pid: add pidfd_open()
    pidfd: add polling selftests
    pidfd: add polling support

    Linus Torvalds
     

09 Jul, 2019

2 commits

  • …iederm/user-namespace

    Pull force_sig() argument change from Eric Biederman:
    "A source of error over the years has been that force_sig has taken a
    task parameter when it is only safe to use force_sig with the current
    task.

    The force_sig function is built for delivering synchronous signals
    such as SIGSEGV where the userspace application caused a synchronous
    fault (such as a page fault) and the kernel responded with a signal.

    Because the name force_sig does not make this clear, and because the
    force_sig takes a task parameter the function force_sig has been
    abused for sending other kinds of signals over the years. Slowly those
    have been fixed when the oopses have been tracked down.

    This set of changes fixes the remaining abusers of force_sig and
    carefully rips out the task parameter from force_sig and friends
    making this kind of error almost impossible in the future"

    * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
    signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus
    signal: Remove the signal number and task parameters from force_sig_info
    signal: Factor force_sig_info_to_task out of force_sig_info
    signal: Generate the siginfo in force_sig
    signal: Move the computation of force into send_signal and correct it.
    signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal
    signal: Remove the task parameter from force_sig_fault
    signal: Use force_sig_fault_to_task for the two calls that don't deliver to current
    signal: Explicitly call force_sig_fault on current
    signal/unicore32: Remove tsk parameter from __do_user_fault
    signal/arm: Remove tsk parameter from __do_user_fault
    signal/arm: Remove tsk parameter from ptrace_break
    signal/nds32: Remove tsk parameter from send_sigtrap
    signal/riscv: Remove tsk parameter from do_trap
    signal/sh: Remove tsk parameter from force_sig_info_fault
    signal/um: Remove task parameter from send_sigtrap
    signal/x86: Remove task parameter from send_sigtrap
    signal: Remove task parameter from force_sig_mceerr
    signal: Remove task parameter from force_sig
    signal: Remove task parameter from force_sigsegv
    ...

    Linus Torvalds
     
  • Pull audit updates from Paul Moore:
    "This pull request is a bit early, but with some vacation time coming
    up I wanted to send this out now just in case the remote Internet Gods
    decide not to smile on me once the merge window opens. The patchset
    for v5.3 is pretty minor this time, the highlights include:

    - When the audit daemon is sent a signal, ensure we deliver
    information about the sender even when syscall auditing is not
    enabled/supported.

    - Add the ability to filter audit records based on network address
    family.

    - Tighten the audit field filtering restrictions on string based
    fields.

    - Cleanup the audit field filtering verification code.

    - Remove a few BUG() calls from the audit code"

    * tag 'audit-pr-20190702' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
    audit: remove the BUG() calls in the audit rule comparison functions
    audit: enforce op for string fields
    audit: add saddr_fam filter field
    audit: re-structure audit field valid checks
    audit: deliver signal_info regarless of syscall

    Linus Torvalds
     

29 Jun, 2019

1 commit

  • This is the minimal fix for stable, I'll send cleanups later.

    Commit 854a6ed56839 ("signal: Add restore_user_sigmask()") introduced
    the visible change which breaks user-space: a signal temporary unblocked
    by set_user_sigmask() can be delivered even if the caller returns
    success or timeout.

    Change restore_user_sigmask() to accept the additional "interrupted"
    argument which should be used instead of signal_pending() check, and
    update the callers.

    Eric said:

    : For clarity. I don't think this is required by posix, or fundamentally to
    : remove the races in select. It is what linux has always done and we have
    : applications who care so I agree this fix is needed.
    :
    : Further in any case where the semantic change that this patch rolls back
    : (aka where allowing a signal to be delivered and the select like call to
    : complete) would be advantage we can do as well if not better by using
    : signalfd.
    :
    : Michael is there any chance we can get this guarantee of the linux
    : implementation of pselect and friends clearly documented. The guarantee
    : that if the system call completes successfully we are guaranteed that no
    : signal that is unblocked by using sigmask will be delivered?

    Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com
    Fixes: 854a6ed56839a40f6b5d02a2962f48841482eec4 ("signal: Add restore_user_sigmask()")
    Signed-off-by: Oleg Nesterov
    Reported-by: Eric Wong
    Tested-by: Eric Wong
    Acked-by: "Eric W. Biederman"
    Acked-by: Arnd Bergmann
    Acked-by: Deepa Dinamani
    Cc: Michael Kerrisk
    Cc: Jens Axboe
    Cc: Davidlohr Bueso
    Cc: Jason Baron
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc: David Laight
    Cc: [5.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

28 Jun, 2019

1 commit

  • This patch adds polling support to pidfd.

    Android low memory killer (LMK) needs to know when a process dies once
    it is sent the kill signal. It does so by checking for the existence of
    /proc/pid which is both racy and slow. For example, if a PID is reused
    between when LMK sends a kill signal and checks for existence of the
    PID, since the wrong PID is now possibly checked for existence.
    Using the polling support, LMK will be able to get notified when a process
    exists in race-free and fast way, and allows the LMK to do other things
    (such as by polling on other fds) while awaiting the process being killed
    to die.

    For notification to polling processes, we follow the same existing
    mechanism in the kernel used when the parent of the task group is to be
    notified of a child's death (do_notify_parent). This is precisely when the
    tasks waiting on a poll of pidfd are also awakened in this patch.

    We have decided to include the waitqueue in struct pid for the following
    reasons:
    1. The wait queue has to survive for the lifetime of the poll. Including
    it in task_struct would not be option in this case because the task can
    be reaped and destroyed before the poll returns.

    2. By including the struct pid for the waitqueue means that during
    de_thread(), the new thread group leader automatically gets the new
    waitqueue/pid even though its task_struct is different.

    Appropriate test cases are added in the second patch to provide coverage of
    all the cases the patch is handling.

    Cc: Andy Lutomirski
    Cc: Steven Rostedt
    Cc: Daniel Colascione
    Cc: Jann Horn
    Cc: Tim Murray
    Cc: Jonathan Kowalski
    Cc: Linus Torvalds
    Cc: Al Viro
    Cc: Kees Cook
    Cc: David Howells
    Cc: Oleg Nesterov
    Cc: kernel-team@android.com
    Reviewed-by: Oleg Nesterov
    Co-developed-by: Daniel Colascione
    Signed-off-by: Daniel Colascione
    Signed-off-by: Joel Fernandes (Google)
    Signed-off-by: Christian Brauner

    Joel Fernandes (Google)
     

05 Jun, 2019

1 commit

  • Improve the comments for pidfd_send_signal().
    First, the comment still referred to a file descriptor for a process as a
    "task file descriptor" which stems from way back at the beginning of the
    discussion. Replace this with "pidfd" for consistency.
    Second, the wording for the explanation of the arguments to the syscall
    was a bit inconsistent, e.g. some used the past tense some used present
    tense. Make the wording more consistent.

    Signed-off-by: Christian Brauner

    Christian Brauner
     

02 Jun, 2019

1 commit

  • In the fixes commit, removing SIGKILL from each thread signal mask and
    executing "goto fatal" directly will skip the call to
    "trace_signal_deliver". At this point, the delivery tracking of the
    SIGKILL signal will be inaccurate.

    Therefore, we need to add trace_signal_deliver before "goto fatal" after
    executing sigdelset.

    Note: SEND_SIG_NOINFO matches the fact that SIGKILL doesn't have any info.

    Link: http://lkml.kernel.org/r/20190425025812.91424-1-weizhenliang@huawei.com
    Fixes: cf43a757fd4944 ("signal: Restore the stop PTRACE_EVENT_EXIT")
    Signed-off-by: Zhenliang Wei
    Reviewed-by: Christian Brauner
    Reviewed-by: Oleg Nesterov
    Cc: Eric W. Biederman
    Cc: Ivan Delalande
    Cc: Arnd Bergmann
    Cc: Thomas Gleixner
    Cc: Deepa Dinamani
    Cc: Greg Kroah-Hartman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhenliang Wei
     

29 May, 2019

6 commits

  • force_sig_info always delivers to the current task and the signal
    parameter always matches info.si_signo. So remove those parameters to
    make it a simpler less error prone interface, and to make it clear
    that none of the callers are doing anything clever.

    This guarantees that force_sig_info will not grow any new buggy
    callers that attempt to call force_sig on a non-current task, or that
    pass an signal number that does not match info.si_signo.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • All callers of force_sig_info pass info.si_signo in for the signal
    by definition as well as in practice.

    Further all callers of force_sig_info except force_sig_fault_to_task
    pass current as the target task to force_sig_info.

    Factor out a static force_sig_info_to_task that
    force_sig_fault_to_task can call.

    This prepares the way for force_sig_info to have it's task and signal
    parameters removed.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • In preparation for removing the special case in force_sig_info for
    only having a signal number generate an appropriate siginfo in
    force_sig the last caller of force_sig_info that does not
    pass a filled out siginfo.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Forcing a signal or not allowing a pid namespace init to ignore
    SIGKILL or SIGSTOP is more cleanly computed in send_signal.

    There are two cases where we don't allow a pid namespace init
    to ignore SIGKILL or SIGSTOP. If the sending process is
    from an ancestor pid namespace and as such is effectively
    the god to the target process, and if the it is the kernel
    that is sending the signal, not another application.

    It is known that a process is from an ancestor pid namespace if
    it can see it's target but it's target does not have a pid for
    the sender in it's pid namespace.

    It is know that a signal is sent from the kernel if si_code is set to
    SI_KERNEL or info is SEND_SIG_PRIV (which ultimately generates
    a signal with si_code == SI_KERNEL).

    The only signals that matter are SIGKILL and SIGSTOP neither of
    which can really be caught, and both of which always have a siginfo
    layout that includes si_uid and si_pid. Therefore we never need
    to worry about forcing a signal when si_pid and si_uid are absent.

    So handle the two special cases of info and the case when si_pid and
    si_uid are present.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Any time siginfo is not stored in the signal queue information is
    lost. Therefore set TRACE_SIGNAL_LOSE_INFO every time the code does
    not allocate a signal queue entry, and a queue overflow abort is not
    triggered.

    Fixes: ba005e1f4172 ("tracepoint: Add signal loss events")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • As synchronous exceptions really only make sense against the current
    task (otherwise how are you synchronous) remove the task parameter
    from from force_sig_fault to make it explicit that is what is going
    on.

    The two known exceptions that deliver a synchronous exception to a
    stopped ptraced task have already been changed to
    force_sig_fault_to_task.

    The callers have been changed with the following emacs regular expression
    (with obvious variations on the architectures that take more arguments)
    to avoid typos:

    force_sig_fault[(]\([^,]+\)[,]\([^,]+\)[,]\([^,]+\)[,]\W+current[)]
    ->
    force_sig_fault(\1,\2,\3)

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman