25 Jan, 2021

1 commit


09 Jan, 2021

1 commit

  • [ Upstream commit f7cfd871ae0c5008d94b6f66834e7845caa93c15 ]

    Recently syzbot reported[0] that there is a deadlock amongst the users
    of exec_update_mutex. The problematic lock ordering found by lockdep
    was:

    perf_event_open (exec_update_mutex -> ovl_i_mutex)
    chown (ovl_i_mutex -> sb_writes)
    sendfile (sb_writes -> p->lock)
    by reading from a proc file and writing to overlayfs
    proc_pid_syscall (p->lock -> exec_update_mutex)

    While looking at possible solutions it occured to me that all of the
    users and possible users involved only wanted to state of the given
    process to remain the same. They are all readers. The only writer is
    exec.

    There is no reason for readers to block on each other. So fix
    this deadlock by transforming exec_update_mutex into a rw_semaphore
    named exec_update_lock that only exec takes for writing.

    Cc: Jann Horn
    Cc: Vasiliy Kulikov
    Cc: Al Viro
    Cc: Bernd Edlinger
    Cc: Oleg Nesterov
    Cc: Christopher Yeoh
    Cc: Cyrill Gorcunov
    Cc: Sargun Dhillon
    Cc: Christian Brauner
    Cc: Arnd Bergmann
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Fixes: eea9673250db ("exec: Add exec_update_mutex to replace cred_guard_mutex")
    [0] https://lkml.kernel.org/r/00000000000063640c05ade8e3de@google.com
    Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com
    Link: https://lkml.kernel.org/r/87ft4mbqen.fsf@x220.int.ebiederm.org
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Sasha Levin

    Eric W. Biederman
     

26 Oct, 2020

1 commit


25 Oct, 2020

1 commit


19 Oct, 2020

1 commit

  • process_madvise syscall needs pidfd_get_pid function to translate pidfd to
    pid so this patch move the function to kernel/pid.c.

    Suggested-by: Alexander Duyck
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Suren Baghdasaryan
    Reviewed-by: Alexander Duyck
    Reviewed-by: Vlastimil Babka
    Acked-by: Christian Brauner
    Acked-by: David Rientjes
    Cc: Jens Axboe
    Cc: Jann Horn
    Cc: Brian Geffon
    Cc: Daniel Colascione
    Cc: Joel Fernandes
    Cc: Johannes Weiner
    Cc: John Dias
    Cc: Kirill Tkhai
    Cc: Michal Hocko
    Cc: Oleksandr Natalenko
    Cc: Sandeep Patil
    Cc: SeongJae Park
    Cc: SeongJae Park
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Tim Murray
    Cc: Christian Brauner
    Cc: Florian Weimer
    Cc:
    Link: http://lkml.kernel.org/r/20200302193630.68771-5-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200622192900.22757-3-minchan@kernel.org
    Link: https://lkml.kernel.org/r/20200901000633.1920247-3-minchan@kernel.org
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

04 Sep, 2020

1 commit

  • Introduce PIDFD_NONBLOCK to support non-blocking pidfd file descriptors.

    Ever since the introduction of pidfds and more advanced async io various
    programming languages such as Rust have grown support for async event
    libraries. These libraries are created to help build epoll-based event loops
    around file descriptors. A common pattern is to automatically make all file
    descriptors they manage to O_NONBLOCK.

    For such libraries the EAGAIN error code is treated specially. When a function
    is called that returns EAGAIN the function isn't called again until the event
    loop indicates the the file descriptor is ready. Supporting EAGAIN when
    waiting on pidfds makes such libraries just work with little effort. In the
    following patch we will extend waitid() internally to support non-blocking
    pidfds.

    This introduces a new flag PIDFD_NONBLOCK that is equivalent to O_NONBLOCK.
    This follows the same patterns we have for other (anon inode) file descriptors
    such as EFD_NONBLOCK, IN_NONBLOCK, SFD_NONBLOCK, TFD_NONBLOCK and the same for
    close-on-exec flags.

    Suggested-by: Josh Triplett
    Signed-off-by: Christian Brauner
    Reviewed-by: Josh Triplett
    Reviewed-by: Oleg Nesterov
    Cc: Kees Cook
    Cc: Sargun Dhillon
    Cc: Oleg Nesterov
    Link: https://lore.kernel.org/lkml/20200811181236.GA18763@localhost/
    Link: https://github.com/joshtriplett/async-pidfd
    Link: https://lore.kernel.org/r/20200902102130.147672-2-christian.brauner@ubuntu.com

    Christian Brauner
     

12 Aug, 2020

1 commit

  • - Add EXPORT_SYMBOL_GPL for find_task_by_vpid() so that drivers
    can be loadable as a module.

    - This API is required by loadable driver module from samsung to
    read process related information based on pid and thread id.
    To get information on when a certain process or thread was started,
    duration of run, Average load contributed by it.

    Signed-off-by: Abhilasha Rao
    Bug: 158067689
    Change-Id: I0db9cc50c93eedff0f3e9dea0ac09a5d17d118f0

    Abhilasha Rao
     

05 Aug, 2020

1 commit

  • …rnel/git/brauner/linux

    Pull checkpoint-restore updates from Christian Brauner:
    "This enables unprivileged checkpoint/restore of processes.

    Given that this work has been going on for quite some time the first
    sentence in this summary is hopefully more exciting than the actual
    final code changes required. Unprivileged checkpoint/restore has seen
    a frequent increase in interest over the last two years and has thus
    been one of the main topics for the combined containers &
    checkpoint/restore microconference since at least 2018 (cf. [1]).

    Here are just the three most frequent use-cases that were brought forward:

    - The JVM developers are integrating checkpoint/restore into a Java
    VM to significantly decrease the startup time.

    - In high-performance computing environment a resource manager will
    typically be distributing jobs where users are always running as
    non-root. Long-running and "large" processes with significant
    startup times are supposed to be checkpointed and restored with
    CRIU.

    - Container migration as a non-root user.

    In all of these scenarios it is either desirable or required to run
    without CAP_SYS_ADMIN. The userspace implementation of
    checkpoint/restore CRIU already has the pull request for supporting
    unprivileged checkpoint/restore up (cf. [2]).

    To enable unprivileged checkpoint/restore a new dedicated capability
    CAP_CHECKPOINT_RESTORE is introduced. This solution has last been
    discussed in 2019 in a talk by Google at Linux Plumbers (cf. [1]
    "Update on Task Migration at Google Using CRIU") with Adrian and
    Nicolas providing the implementation now over the last months. In
    essence, this allows the CRIU binary to be installed with the
    CAP_CHECKPOINT_RESTORE vfs capability set thereby enabling
    unprivileged users to restore processes.

    To make this possible the following permissions are altered:

    - Selecting a specific PID via clone3() set_tid relaxed from userns
    CAP_SYS_ADMIN to CAP_CHECKPOINT_RESTORE.

    - Selecting a specific PID via /proc/sys/kernel/ns_last_pid relaxed
    from userns CAP_SYS_ADMIN to CAP_CHECKPOINT_RESTORE.

    - Accessing /proc/pid/map_files relaxed from init userns
    CAP_SYS_ADMIN to init userns CAP_CHECKPOINT_RESTORE.

    - Changing /proc/self/exe from userns CAP_SYS_ADMIN to userns
    CAP_CHECKPOINT_RESTORE.

    Of these four changes the /proc/self/exe change deserves a few words
    because the reasoning behind even restricting /proc/self/exe changes
    in the first place is just full of historical quirks and tracking this
    down was a questionable version of fun that I'd like to spare others.

    In short, it is trivial to change /proc/self/exe as an unprivileged
    user, i.e. without userns CAP_SYS_ADMIN right now. Either via ptrace()
    or by simply intercepting the elf loader in userspace during exec.
    Nicolas was nice enough to even provide a POC for the latter (cf. [3])
    to illustrate this fact.

    The original patchset which introduced PR_SET_MM_MAP had no
    permissions around changing the exe link. They too argued that it is
    trivial to spoof the exe link already which is true. The argument
    brought up against this was that the Tomoyo LSM uses the exe link in
    tomoyo_manager() to detect whether the calling process is a policy
    manager. This caused changing the exe links to be guarded by userns
    CAP_SYS_ADMIN.

    All in all this rather seems like a "better guard it with something
    rather than nothing" argument which imho doesn't qualify as a great
    security policy. Again, because spoofing the exe link is possible for
    the calling process so even if this were security relevant it was
    broken back then and would be broken today. So technically, dropping
    all permissions around changing the exe link would probably be
    possible and would send a clearer message to any userspace that relies
    on /proc/self/exe for security reasons that they should stop doing
    this but for now we're only relaxing the exe link permissions from
    userns CAP_SYS_ADMIN to userns CAP_CHECKPOINT_RESTORE.

    There's a final uapi change in here. Changing the exe link used to
    accidently return EINVAL when the caller lacked the necessary
    permissions instead of the more correct EPERM. This pr contains a
    commit fixing this. I assume that userspace won't notice or care and
    if they do I will revert this commit. But since we are changing the
    permissions anyway it seems like a good opportunity to try this fix.

    With these changes merged unprivileged checkpoint/restore will be
    possible and has already been tested by various users"

    [1] LPC 2018
    1. "Task Migration at Google Using CRIU"
    https://www.youtube.com/watch?v=yI_1cuhoDgA&t=12095
    2. "Securely Migrating Untrusted Workloads with CRIU"
    https://www.youtube.com/watch?v=yI_1cuhoDgA&t=14400
    LPC 2019
    1. "CRIU and the PID dance"
    https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=2m48s
    2. "Update on Task Migration at Google Using CRIU"
    https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=1h2m8s

    [2] https://github.com/checkpoint-restore/criu/pull/1155

    [3] https://github.com/nviennot/run_as_exe

    * tag 'cap-checkpoint-restore-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    selftests: add clone3() CAP_CHECKPOINT_RESTORE test
    prctl: exe link permission error changed from -EINVAL to -EPERM
    prctl: Allow local CAP_CHECKPOINT_RESTORE to change /proc/self/exe
    proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE
    pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
    pid: use checkpoint_restore_ns_capable() for set_tid
    capabilities: Introduce CAP_CHECKPOINT_RESTORE

    Linus Torvalds
     

20 Jul, 2020

1 commit

  • Use the newly introduced capability CAP_CHECKPOINT_RESTORE to allow
    using clone3() with set_tid set.

    Signed-off-by: Adrian Reber
    Signed-off-by: Nicolas Viennot
    Reviewed-by: Serge Hallyn
    Acked-by: Christian Brauner
    Link: https://lore.kernel.org/r/20200719100418.2112740-3-areber@redhat.com
    Signed-off-by: Christian Brauner

    Adrian Reber
     

14 Jul, 2020

2 commits

  • Replace the open-coded version of receive_fd() with a call to the
    new helper.

    Thanks to Vamshi K Sthambamkadi for
    catching a missed fput() in an earlier version of this patch.

    Cc: Christoph Hellwig
    Cc: Jakub Kicinski
    Cc: netdev@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Reviewed-by: Sargun Dhillon
    Acked-by: Christian Brauner
    Signed-off-by: Kees Cook

    Kees Cook
     
  • The sock counting (sock_update_netprioidx() and sock_update_classid())
    was missing from pidfd's implementation of received fd installation. Add
    a call to the new __receive_sock() helper.

    Cc: Christian Brauner
    Cc: Christoph Hellwig
    Cc: Sargun Dhillon
    Cc: Jakub Kicinski
    Cc: netdev@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: stable@vger.kernel.org
    Fixes: 8649c322f75c ("pid: Implement pidfd_getfd syscall")
    Signed-off-by: Kees Cook

    Kees Cook
     

30 Apr, 2020

1 commit

  • Starting from 2c4704756cab ("pids: Move the pgrp and session pid pointers
    from task_struct to signal_struct") __task_pid_nr_ns() doesn't dereference
    task->group_leader, we can remove the pid_alive() check.

    pid_nr_ns() has to check pid != NULL anyway, pid_alive() just adds the
    unnecessary confusion.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: "Eric W. Biederman"
    Acked-by: Christian Brauner
    Signed-off-by: Eric W. Biederman

    Oleg Nesterov
     

29 Apr, 2020

1 commit

  • When the thread group leader changes during exec and the old leaders
    thread is reaped proc_flush_pid will flush the dentries for the entire
    process because the leader still has it's original pid.

    Fix this by exchanging the pids in an rcu safe manner,
    and wrapping the code to do that up in a helper exchange_tids.

    When I removed switch_exec_pids and introduced this behavior
    in d73d65293e3e ("[PATCH] pidhash: kill switch_exec_pids") there
    really was nothing that cared as flushing happened with
    the cached dentry and de_thread flushed both of them on exec.

    This lack of fully exchanging pids became a problem a few months later
    when I introduced 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry
    flush on exit optimization"). Which overlooked the de_thread case
    was no longer swapping pids, and I was looking up proc dentries
    by task->pid.

    The current behavior isn't properly a bug as everything in proc will
    continue to work correctly just a little bit less efficiently. Fix
    this just so there are no little surprise corner cases waiting to bite
    people.

    -- Oleg points out this could be an issue in next_tgid in proc where
    has_group_leader_pid is called, and reording some of the assignments
    should fix that.

    -- Oleg points out this will break the 10 year old hack in __exit_signal.c
    > /*
    > * This can only happen if the caller is de_thread().
    > * FIXME: this is the temporary hack, we should teach
    > * posix-cpu-timers to handle this case correctly.
    > */
    > if (unlikely(has_group_leader_pid(tsk)))
    > posix_cpu_timers_exit_group(tsk);

    The code in next_tgid has been changed to use PIDTYPE_TGID,
    and the posix cpu timers code has been fixed so it does not
    need the 10 year old hack, so this should be safe to merge
    now.

    Link: https://lore.kernel.org/lkml/87h7x3ajll.fsf_-_@x220.int.ebiederm.org/
    Acked-by: Linus Torvalds
    Acked-by: Oleg Nesterov
    Fixes: 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry flush on exit optimization").
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

11 Apr, 2020

1 commit


10 Apr, 2020

1 commit

  • syzbot wrote:
    > ========================================================
    > WARNING: possible irq lock inversion dependency detected
    > 5.6.0-syzkaller #0 Not tainted
    > --------------------------------------------------------
    > swapper/1/0 just changed the state of lock:
    > ffffffff898090d8 (tasklist_lock){.+.?}-{2:2}, at: send_sigurg+0x9f/0x320 fs/fcntl.c:840
    > but this lock took another, SOFTIRQ-unsafe lock in the past:
    > (&pid->wait_pidfd){+.+.}-{2:2}
    >
    >
    > and interrupts could create inverse lock ordering between them.
    >
    >
    > other info that might help us debug this:
    > Possible interrupt unsafe locking scenario:
    >
    > CPU0 CPU1
    > ---- ----
    > lock(&pid->wait_pidfd);
    > local_irq_disable();
    > lock(tasklist_lock);
    > lock(&pid->wait_pidfd);
    >
    > lock(tasklist_lock);
    >
    > *** DEADLOCK ***
    >
    > 4 locks held by swapper/1/0:

    The problem is that because wait_pidfd.lock is taken under the tasklist
    lock. It must always be taken with irqs disabled as tasklist_lock can be
    taken from interrupt context and if wait_pidfd.lock was already taken this
    would create a lock order inversion.

    Oleg suggested just disabling irqs where I have added extra calls to
    wait_pidfd.lock. That should be safe and I think the code will eventually
    do that. It was rightly pointed out by Christian that sharing the
    wait_pidfd.lock was a premature optimization.

    It is also true that my pre-merge window testing was insufficient. So
    remove the premature optimization and give struct pid a dedicated lock of
    it's own for struct pid things. I have verified that lockdep sees all 3
    paths where we take the new pid->lock and lockdep does not complain.

    It is my current day dream that one day pid->lock can be used to guard the
    task lists as well and then the tasklist_lock won't need to be held to
    deliver signals. That will require taking pid->lock with irqs disabled.

    Acked-by: Christian Brauner
    Link: https://lore.kernel.org/lkml/00000000000011d66805a25cd73f@google.com/
    Cc: Oleg Nesterov
    Cc: Christian Brauner
    Reported-by: syzbot+343f75cdeea091340956@syzkaller.appspotmail.com
    Reported-by: syzbot+832aabf700bc3ec920b9@syzkaller.appspotmail.com
    Reported-by: syzbot+f675f964019f884dbd0f@syzkaller.appspotmail.com
    Reported-by: syzbot+a9fb1457d720a55d6dc5@syzkaller.appspotmail.com
    Fixes: 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

03 Apr, 2020

1 commit

  • Pull exec/proc updates from Eric Biederman:
    "This contains two significant pieces of work: the work to sort out
    proc_flush_task, and the work to solve a deadlock between strace and
    exec.

    Fixing proc_flush_task so that it no longer requires a persistent
    mount makes improvements to proc possible. The removal of the
    persistent mount solves an old regression that that caused the hidepid
    mount option to only work on remount not on mount. The regression was
    found and reported by the Android folks. This further allows Alexey
    Gladkov's work making proc mount options specific to an individual
    mount of proc to move forward.

    The work on exec starts solving a long standing issue with exec that
    it takes mutexes of blocking userspace applications, which makes exec
    extremely deadlock prone. For the moment this adds a second mutex with
    a narrower scope that handles all of the easy cases. Which makes the
    tricky cases easy to spot. With a little luck the code to solve those
    deadlocks will be ready by next merge window"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (25 commits)
    signal: Extend exec_id to 64bits
    pidfd: Use new infrastructure to fix deadlocks in execve
    perf: Use new infrastructure to fix deadlocks in execve
    proc: io_accounting: Use new infrastructure to fix deadlocks in execve
    proc: Use new infrastructure to fix deadlocks in execve
    kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
    kernel: doc: remove outdated comment cred.c
    mm: docs: Fix a comment in process_vm_rw_core
    selftests/ptrace: add test cases for dead-locks
    exec: Fix a deadlock in strace
    exec: Add exec_update_mutex to replace cred_guard_mutex
    exec: Move exec_mmap right after de_thread in flush_old_exec
    exec: Move cleanup of posix timers on exec out of de_thread
    exec: Factor unshare_sighand out of de_thread and call it separately
    exec: Only compute current once in flush_old_exec
    pid: Improve the comment about waiting in zap_pid_ns_processes
    proc: Remove the now unnecessary internal mount of proc
    uml: Create a private mount of proc for mconsole
    uml: Don't consult current to find the proc_mnt in mconsole_proc
    proc: Use a list of inodes to flush from proc
    ...

    Linus Torvalds
     

25 Mar, 2020

1 commit

  • This changes __pidfd_fget to use the new exec_update_mutex
    instead of cred_guard_mutex.

    This should be safe, as the credentials do not change
    before exec_update_mutex is locked. Therefore whatever
    file access is possible with holding the cred_guard_mutex
    here is also possbile with the exec_update_mutex.

    Signed-off-by: Bernd Edlinger
    Signed-off-by: Eric W. Biederman

    Bernd Edlinger
     

10 Mar, 2020

1 commit

  • The alloc_pid() codepath used to be simpler. With the introducation of the
    ability to choose specific pids in 49cb2fc42ce4 ("fork: extend clone3() to
    support setting a PID") it got more complex. It hasn't been super obvious
    that ENOMEM is returned when the pid namespace init process/child subreaper
    of the pid namespace has died. As can be seen from multiple attempts to
    improve this see e.g. [1] and most recently [2].
    We regressed returning ENOMEM in [3] and [2] restored it. Let's add a
    comment on top explaining that this is historic and documented behavior and
    cannot easily be changed.

    [1]: 35f71bc0a09a ("fork: report pid reservation failure properly")
    [2]: b26ebfe12f34 ("pid: Fix error return value in some cases")
    [3]: 49cb2fc42ce4 ("fork: extend clone3() to support setting a PID")
    Signed-off-by: Christian Brauner

    Christian Brauner
     

08 Mar, 2020

1 commit

  • Recent changes to alloc_pid() allow the pid number to be specified on
    the command line. If set_tid_size is set, then the code scanning the
    levels will hard-set retval to -EPERM, overriding it's previous -ENOMEM
    value.

    After the code scanning the levels, there are error returns that do not
    set retval, assuming it is still set to -ENOMEM.

    So set retval back to -ENOMEM after scanning the levels.

    Fixes: 49cb2fc42ce4 ("fork: extend clone3() to support setting a PID")
    Signed-off-by: Corey Minyard
    Acked-by: Christian Brauner
    Cc: Andrei Vagin
    Cc: Dmitry Safonov
    Cc: Oleg Nesterov
    Cc: Adrian Reber
    Cc: # 5.5
    Link: https://lore.kernel.org/r/20200306172314.12232-1-minyard@acm.org
    [christian.brauner@ubuntu.com: fixup commit message]
    Signed-off-by: Christian Brauner

    Corey Minyard
     

29 Feb, 2020

1 commit

  • There remains no more code in the kernel using pids_ns->proc_mnt,
    therefore remove it from the kernel.

    The big benefit of this change is that one of the most error prone and
    tricky parts of the pid namespace implementation, maintaining kernel
    mounts of proc is removed.

    In addition removing the unnecessary complexity of the kernel mount
    fixes a regression that caused the proc mount options to be ignored.
    Now that the initial mount of proc comes from userspace, those mount
    options are again honored. This fixes Android's usage of the proc
    hidepid option.

    Reported-by: Alistair Strachan
    Fixes: e94591d0d90c ("proc: Convert proc_mount to use mount_ns.")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

25 Feb, 2020

1 commit

  • Rework the flushing of proc to use a list of directory inodes that
    need to be flushed.

    The list is kept on struct pid not on struct task_struct, as there is
    a fixed connection between proc inodes and pids but at least for the
    case of de_thread the pid of a task_struct changes.

    This removes the dependency on proc_mnt which allows for different
    mounts of proc having different mount options even in the same pid
    namespace and this allows for the removal of proc_mnt which will
    trivially the first mount of proc to honor it's mount options.

    This flushing remains an optimization. The functions
    pid_delete_dentry and pid_revalidate ensure that ordinary dcache
    management will not attempt to use dentries past the point their
    respective task has died. When unused the shrinker will
    eventually be able to remove these dentries.

    There is a case in de_thread where proc_flush_pid can be
    called early for a given pid. Which winds up being
    safe (if suboptimal) as this is just an optiimization.

    Only pid directories are put on the list as the other
    per pid files are children of those directories and
    d_invalidate on the directory will get them as well.

    So that the pid can be used during flushing it's reference count is
    taken in release_task and dropped in proc_flush_pid. Further the call
    of proc_flush_pid is moved after the tasklist_lock is released in
    release_task so that it is certain that the pid has already been
    unhashed when flushing it taking place. This removes a small race
    where a dentry could recreated.

    As struct pid is supposed to be small and I need a per pid lock
    I reuse the only lock that currently exists in struct pid the
    the wait_pidfd.lock.

    The net result is that this adds all of this functionality
    with just a little extra list management overhead and
    a single extra pointer in struct pid.

    v2: Initialize pid->inodes. I somehow failed to get that
    initialization into the initial version of the patch. A boot
    failure was reported by "kernel test robot ", and
    failure to initialize that pid->inodes matches all of the reported
    symptoms.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

14 Jan, 2020

1 commit

  • This syscall allows for the retrieval of file descriptors from other
    processes, based on their pidfd. This is possible using ptrace, and
    injection of parasitic code to inject code which leverages SCM_RIGHTS
    to move file descriptors between a tracee and a tracer. Unfortunately,
    ptrace comes with a high cost of requiring the process to be stopped,
    and breaks debuggers. This does not require stopping the process under
    manipulation.

    One reason to use this is to allow sandboxers to take actions on file
    descriptors on the behalf of another process. For example, this can be
    combined with seccomp-bpf's user notification to do on-demand fd
    extraction and take privileged actions. One such privileged action
    is binding a socket to a privileged port.

    /* prototype */
    /* flags is currently reserved and should be set to 0 */
    int sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);

    /* testing */
    Ran self-test suite on x86_64

    Signed-off-by: Sargun Dhillon
    Acked-by: Christian Brauner
    Reviewed-by: Arnd Bergmann
    Link: https://lore.kernel.org/r/20200107175927.4558-3-sargun@sargun.me
    Signed-off-by: Christian Brauner

    Sargun Dhillon
     

16 Nov, 2019

1 commit

  • The main motivation to add set_tid to clone3() is CRIU.

    To restore a process with the same PID/TID CRIU currently uses
    /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
    ns_last_pid and then (quickly) does a clone(). This works most of the
    time, but it is racy. It is also slow as it requires multiple syscalls.

    Extending clone3() to support *set_tid makes it possible restore a
    process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
    race free (as long as the desired PID/TID is available).

    This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
    on clone3() with *set_tid as they are currently in place for ns_last_pid.

    The original version of this change was using a single value for
    set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
    decided to change set_tid to an array to enable setting the PID of a
    process in multiple PID namespaces at the same time. If a process is
    created in a PID namespace it is possible to influence the PID inside
    and outside of the PID namespace. Details also in the corresponding
    selftest.

    To create a process with the following PIDs:

    PID NS level Requested PID
    0 (host) 31496
    1 42
    2 1

    For that example the two newly introduced parameters to struct
    clone_args (set_tid and set_tid_size) would need to be:

    set_tid[0] = 1;
    set_tid[1] = 42;
    set_tid[2] = 31496;
    set_tid_size = 3;

    If only the PIDs of the two innermost nested PID namespaces should be
    defined it would look like this:

    set_tid[0] = 1;
    set_tid[1] = 42;
    set_tid_size = 2;

    The PID of the newly created process would then be the next available
    free PID in the PID namespace level 0 (host) and 42 in the PID namespace
    at level 1 and the PID of the process in the innermost PID namespace
    would be 1.

    The set_tid array is used to specify the PID of a process starting
    from the innermost nested PID namespaces up to set_tid_size PID namespaces.

    set_tid_size cannot be larger then the current PID namespace level.

    Signed-off-by: Adrian Reber
    Reviewed-by: Christian Brauner
    Reviewed-by: Oleg Nesterov
    Reviewed-by: Dmitry Safonov
    Acked-by: Andrei Vagin
    Link: https://lore.kernel.org/r/20191115123621.142252-1-areber@redhat.com
    Signed-off-by: Christian Brauner

    Adrian Reber
     

17 Oct, 2019

2 commits

  • Use the new pid_has_task() helper in pidfd_open(). This simplifies the
    code and avoids taking rcu_read_{lock,unlock}() and leads to overall
    nicer code.

    Signed-off-by: Christian Brauner
    Reviewed-by: Oleg Nesterov
    Link: https://lore.kernel.org/r/20191017101832.5985-5-christian.brauner@ubuntu.com

    Christian Brauner
     
  • Replace hlist_empty() with the new pid_has_task() helper which is more
    idiomatic, easier to grep for, and unifies how callers perform this
    check.

    Signed-off-by: Christian Brauner
    Reviewed-by: Oleg Nesterov
    Link: https://lore.kernel.org/r/20191017101832.5985-3-christian.brauner@ubuntu.com

    Christian Brauner
     

17 Jul, 2019

1 commit

  • struct pid's count is an atomic_t field used as a refcount. Use
    refcount_t for it which is basically atomic_t but does additional
    checking to prevent use-after-free bugs.

    For memory ordering, the only change is with the following:

    - if ((atomic_read(&pid->count) == 1) ||
    - atomic_dec_and_test(&pid->count)) {
    + if (refcount_dec_and_test(&pid->count)) {
    kmem_cache_free(ns->pid_cachep, pid);

    Here the change is from: Fully ordered --> RELEASE + ACQUIRE (as per
    refcount-vs-atomic.rst) This ACQUIRE should take care of making sure the
    free happens after the refcount_dec_and_test().

    The above hunk also removes atomic_read() since it is not needed for the
    code to work and it is unclear how beneficial it is. The removal lets
    refcount_dec_and_test() check for cases where get_pid() happened before
    the object was freed.

    Link: http://lkml.kernel.org/r/20190701183826.191936-1-joel@joelfernandes.org
    Signed-off-by: Joel Fernandes (Google)
    Reviewed-by: Andrea Parri
    Reviewed-by: Kees Cook
    Cc: Mathieu Desnoyers
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Will Deacon
    Cc: Paul E. McKenney
    Cc: Elena Reshetova
    Cc: Jann Horn
    Cc: Eric W. Biederman
    Cc: KJ Tsanaktsidis
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joel Fernandes (Google)
     

11 Jul, 2019

1 commit

  • Pull pidfd updates from Christian Brauner:
    "This adds two main features.

    - First, it adds polling support for pidfds. This allows process
    managers to know when a (non-parent) process dies in a race-free
    way.

    The notification mechanism used follows the same logic that is
    currently used when the parent of a task is notified of a child's
    death. With this patchset it is possible to put pidfds in an
    {e}poll loop and get reliable notifications for process (i.e.
    thread-group) exit.

    - The second feature compliments the first one by making it possible
    to retrieve pollable pidfds for processes that were not created
    using CLONE_PIDFD.

    A lot of processes get created with traditional PID-based calls
    such as fork() or clone() (without CLONE_PIDFD). For these
    processes a caller can currently not create a pollable pidfd. This
    is a problem for Android's low memory killer (LMK) and service
    managers such as systemd.

    Both patchsets are accompanied by selftests.

    It's perhaps worth noting that the work done so far and the work done
    in this branch for pidfd_open() and polling support do already see
    some adoption:

    - Android is in the process of backporting this work to all their LTS
    kernels [1]

    - Service managers make use of pidfd_send_signal but will need to
    wait until we enable waiting on pidfds for full adoption.

    - And projects I maintain make use of both pidfd_send_signal and
    CLONE_PIDFD [2] and will use polling support and pidfd_open() too"

    [1] https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.9+backport%22
    https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.14+backport%22
    https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.19+backport%22

    [2] https://github.com/lxc/lxc/blob/aab6e3eb73c343231cdde775db938994fc6f2803/src/lxc/start.c#L1753

    * tag 'pidfd-updates-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    tests: add pidfd_open() tests
    arch: wire-up pidfd_open()
    pid: add pidfd_open()
    pidfd: add polling selftests
    pidfd: add polling support

    Linus Torvalds
     

28 Jun, 2019

2 commits

  • This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
    pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
    process that is created via traditional fork()/clone() calls that is only
    referenced by a PID:

    int pidfd = pidfd_open(1234, 0);
    ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);

    With the introduction of pidfds through CLONE_PIDFD it is possible to
    created pidfds at process creation time.
    However, a lot of processes get created with traditional PID-based calls
    such as fork() or clone() (without CLONE_PIDFD). For these processes a
    caller can currently not create a pollable pidfd. This is a problem for
    Android's low memory killer (LMK) and service managers such as systemd.
    Both are examples of tools that want to make use of pidfds to get reliable
    notification of process exit for non-parents (pidfd polling) and race-free
    signal sending (pidfd_send_signal()). They intend to switch to this API for
    process supervision/management as soon as possible. Having no way to get
    pollable pidfds from PID-only processes is one of the biggest blockers for
    them in adopting this api. With pidfd_open() making it possible to retrieve
    pidfds for PID-based processes we enable them to adopt this api.

    In line with Arnd's recent changes to consolidate syscall numbers across
    architectures, I have added the pidfd_open() syscall to all architectures
    at the same time.

    Signed-off-by: Christian Brauner
    Reviewed-by: David Howells
    Reviewed-by: Oleg Nesterov
    Acked-by: Arnd Bergmann
    Cc: "Eric W. Biederman"
    Cc: Kees Cook
    Cc: Joel Fernandes (Google)
    Cc: Thomas Gleixner
    Cc: Jann Horn
    Cc: Andy Lutomirsky
    Cc: Andrew Morton
    Cc: Aleksa Sarai
    Cc: Linus Torvalds
    Cc: Al Viro
    Cc: linux-api@vger.kernel.org

    Christian Brauner
     
  • This patch adds polling support to pidfd.

    Android low memory killer (LMK) needs to know when a process dies once
    it is sent the kill signal. It does so by checking for the existence of
    /proc/pid which is both racy and slow. For example, if a PID is reused
    between when LMK sends a kill signal and checks for existence of the
    PID, since the wrong PID is now possibly checked for existence.
    Using the polling support, LMK will be able to get notified when a process
    exists in race-free and fast way, and allows the LMK to do other things
    (such as by polling on other fds) while awaiting the process being killed
    to die.

    For notification to polling processes, we follow the same existing
    mechanism in the kernel used when the parent of the task group is to be
    notified of a child's death (do_notify_parent). This is precisely when the
    tasks waiting on a poll of pidfd are also awakened in this patch.

    We have decided to include the waitqueue in struct pid for the following
    reasons:
    1. The wait queue has to survive for the lifetime of the poll. Including
    it in task_struct would not be option in this case because the task can
    be reaped and destroyed before the poll returns.

    2. By including the struct pid for the waitqueue means that during
    de_thread(), the new thread group leader automatically gets the new
    waitqueue/pid even though its task_struct is different.

    Appropriate test cases are added in the second patch to provide coverage of
    all the cases the patch is handling.

    Cc: Andy Lutomirski
    Cc: Steven Rostedt
    Cc: Daniel Colascione
    Cc: Jann Horn
    Cc: Tim Murray
    Cc: Jonathan Kowalski
    Cc: Linus Torvalds
    Cc: Al Viro
    Cc: Kees Cook
    Cc: David Howells
    Cc: Oleg Nesterov
    Cc: kernel-team@android.com
    Reviewed-by: Oleg Nesterov
    Co-developed-by: Daniel Colascione
    Signed-off-by: Daniel Colascione
    Signed-off-by: Joel Fernandes (Google)
    Signed-off-by: Christian Brauner

    Joel Fernandes (Google)
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

1 commit

  • Hash functions are not needed since idr is used now. Let's remove hash
    header file for cleanup.

    Link: http://lkml.kernel.org/r/20190430053319.95913-1-scuttimmy@gmail.com
    Signed-off-by: Timmy Li
    Cc: "Eric W. Biederman"
    Cc: Michal Hocko
    Cc: Matthew Wilcox
    Cc: Oleg Nesterov
    Cc: Mike Rapoport
    Cc: KJ Tsanaktsidis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Timmy Li
     

29 Dec, 2018

1 commit

  • The failure path removes the allocated PIDs from the wrong namespace.
    This could lead to us inadvertently reusing PIDs in the leaf namespace
    and leaking PIDs in parent namespaces.

    Fixes: 95846ecf9dac ("pid: replace pid bitmap implementation with IDR API")
    Cc:
    Signed-off-by: Matthew Wilcox
    Acked-by: "Eric W. Biederman"
    Reviewed-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

31 Oct, 2018

1 commit

  • Move remaining definitions and declarations from include/linux/bootmem.h
    into include/linux/memblock.h and remove the redundant header.

    The includes were replaced with the semantic patch below and then
    semi-automated removal of duplicated '#include

    @@
    @@
    - #include
    + #include

    [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
    Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
    [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
    Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
    [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
    Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
    Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Stephen Rothwell
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

21 Sep, 2018

1 commit

  • Make the clone and fork syscalls return EAGAIN when the limit on the
    number of pids /proc/sys/kernel/pid_max is exceeded.

    Currently, when the pid_max limit is exceeded, the kernel will return
    ENOSPC from the fork and clone syscalls. This is contrary to the
    documented behaviour, which explicitly calls out the pid_max case as one
    where EAGAIN should be returned. It also leads to really confusing error
    messages in userspace programs which will complain about a lack of disk
    space when they fail to create processes/threads for this reason.

    This error is being returned because alloc_pid() uses the idr api to find
    a new pid; when there are none available, idr_alloc_cyclic() returns
    -ENOSPC, and this is being propagated back to userspace.

    This behaviour has been broken before, and was explicitly fixed in
    commit 35f71bc0a09a ("fork: report pid reservation failure properly"),
    so I think -EAGAIN is definitely the right thing to return in this case.
    The current behaviour change dates from commit 95846ecf9dac ("pid:
    replace pid bitmap implementation with IDR AIP") and was I believe
    unintentional.

    This patch has no impact on the case where allocating a pid fails because
    the child reaper for the namespace is dead; that case will still return
    -ENOMEM.

    Link: http://lkml.kernel.org/r/20180903111016.46461-1-ktsanaktsidis@zendesk.com
    Fixes: 95846ecf9dac ("pid: replace pid bitmap implementation with IDR AIP")
    Signed-off-by: KJ Tsanaktsidis
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Gargi Sharma
    Cc: Rik van Riel
    Cc: Oleg Nesterov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    KJ Tsanaktsidis
     

21 Jul, 2018

3 commits

  • Everywhere except in the pid array we distinguish between a tasks pid and
    a tasks tgid (thread group id). Even in the enumeration we want that
    distinction sometimes so we have added __PIDTYPE_TGID. With leader_pid
    we almost have an implementation of PIDTYPE_TGID in struct signal_struct.

    Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
    into the pids array. Then remove the __PIDTYPE_TGID special case and the
    leader_pid in signal_struct.

    The net size increase is just an extra pointer added to struct pid and
    an extra pair of pointers of an hlist_node added to task_struct.

    The effect on code maintenance is the removal of a number of special
    cases today and the potential to remove many more special cases as
    PIDTYPE_TGID gets used to it's fullest. The long term potential
    is allowing zombie thread group leaders to exit, which will remove
    a lot more special cases in the code.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • To access these fields the code always has to go to group leader so
    going to signal struct is no loss and is actually a fundamental simplification.

    This saves a little bit of memory by only allocating the pid pointer array
    once instead of once for every thread, and even better this removes a
    few potential races caused by the fact that group_leader can be changed
    by de_thread, while signal_struct can not.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • The cost is the the same and this removes the need
    to worry about complications that come from de_thread
    and group_leader changing.

    __task_pid_nr_ns has been updated to take advantage of this change.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

12 Apr, 2018

1 commit

  • This results in no change in structure size on 64-bit machines as it
    fits in the padding between the gfp_t and the void *. 32-bit machines
    will grow the structure from 8 to 12 bytes. Almost all radix trees are
    protected with (at least) a spinlock, so as they are converted from
    radix trees to xarrays, the data structures will shrink again.

    Initialising the spinlock requires a name for the benefit of lockdep, so
    RADIX_TREE_INIT() now needs to know the name of the radix tree it's
    initialising, and so do IDR_INIT() and IDA_INIT().

    Also add the xa_lock() and xa_unlock() family of wrappers to make it
    easier to use the lock. If we could rely on -fplan9-extensions in the
    compiler, we could avoid all of this syntactic sugar, but that wasn't
    added until gcc 4.6.

    Link: http://lkml.kernel.org/r/20180313132639.17387-8-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Jeff Layton
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Ryusuke Konishi
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

07 Feb, 2018

1 commit

  • There are several functions that do find_task_by_vpid() followed by
    get_task_struct(). We can use a helper function instead.

    Link: http://lkml.kernel.org/r/1509602027-11337-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

30 Jan, 2018

1 commit

  • Pull init_task initializer cleanups from David Howells:
    "It doesn't seem useful to have the init_task in a header file rather
    than in a normal source file. We could consolidate init_task handling
    instead and expand out various macros.

    Here's a series of patches that consolidate init_task handling:

    (1) Make THREAD_SIZE available to vmlinux.lds for cris, hexagon and
    openrisc.

    (2) Alter the INIT_TASK_DATA linker script macro to set
    init_thread_union and init_stack rather than defining these in C.

    Insert init_task and init_thread_into into the init_stack area in
    the linker script as appropriate to the configuration, with
    different section markers so that they end up correctly ordered.

    We can then get merge ia64's init_task.c into the main one.

    We then have a bunch of single-use INIT_*() macros that seem only
    to be macros because they used to be used per-arch. We can then
    expand these in place of the user and get rid of a few lines and
    a lot of backslashes.

    (3) Expand INIT_TASK() in place.

    (4) Expand in place various small INIT_*() macros that are defined
    conditionally. Expand them and surround them by #if[n]def/#endif
    in the .c file as it takes fewer lines.

    (5) Expand INIT_SIGNALS() and INIT_SIGHAND() in place.

    (6) Expand INIT_STRUCT_PID in place.

    These macros can then be discarded"

    * tag 'init_task-20180117' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    Expand INIT_STRUCT_PID and remove
    Expand the INIT_SIGNALS and INIT_SIGHAND macros and remove
    Expand various INIT_* macros and remove
    Expand INIT_TASK() in init/init_task.c and remove
    Construct init thread stack in the linker script rather than by union
    openrisc: Make THREAD_SIZE available to vmlinux.lds
    hexagon: Make THREAD_SIZE available to vmlinux.lds
    cris: Make THREAD_SIZE available to vmlinux.lds

    Linus Torvalds