25 Jan, 2021
1 commit
-
Change-Id: I173bc0325c93daaf82bb894f77ba35fa827864de
09 Jan, 2021
1 commit
-
[ Upstream commit f7cfd871ae0c5008d94b6f66834e7845caa93c15 ]
Recently syzbot reported[0] that there is a deadlock amongst the users
of exec_update_mutex. The problematic lock ordering found by lockdep
was:perf_event_open (exec_update_mutex -> ovl_i_mutex)
chown (ovl_i_mutex -> sb_writes)
sendfile (sb_writes -> p->lock)
by reading from a proc file and writing to overlayfs
proc_pid_syscall (p->lock -> exec_update_mutex)While looking at possible solutions it occured to me that all of the
users and possible users involved only wanted to state of the given
process to remain the same. They are all readers. The only writer is
exec.There is no reason for readers to block on each other. So fix
this deadlock by transforming exec_update_mutex into a rw_semaphore
named exec_update_lock that only exec takes for writing.Cc: Jann Horn
Cc: Vasiliy Kulikov
Cc: Al Viro
Cc: Bernd Edlinger
Cc: Oleg Nesterov
Cc: Christopher Yeoh
Cc: Cyrill Gorcunov
Cc: Sargun Dhillon
Cc: Christian Brauner
Cc: Arnd Bergmann
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Arnaldo Carvalho de Melo
Fixes: eea9673250db ("exec: Add exec_update_mutex to replace cred_guard_mutex")
[0] https://lkml.kernel.org/r/00000000000063640c05ade8e3de@google.com
Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com
Link: https://lkml.kernel.org/r/87ft4mbqen.fsf@x220.int.ebiederm.org
Signed-off-by: Eric W. Biederman
Signed-off-by: Sasha Levin
26 Oct, 2020
1 commit
-
…/linux/kernel/git/dlemoal/zonefs") into android-mainline
Steps on the way to 5.10-rc1
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I520719ae5e0d992c3756e393cb299d77d650622e
25 Oct, 2020
1 commit
-
…inux-ipmi") into android-mainline
Steps on the way to 5.10-rc1
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Idbd0577a495237bf5628333110e2c98a77b39c77
19 Oct, 2020
1 commit
-
process_madvise syscall needs pidfd_get_pid function to translate pidfd to
pid so this patch move the function to kernel/pid.c.Suggested-by: Alexander Duyck
Signed-off-by: Minchan Kim
Signed-off-by: Andrew Morton
Reviewed-by: Suren Baghdasaryan
Reviewed-by: Alexander Duyck
Reviewed-by: Vlastimil Babka
Acked-by: Christian Brauner
Acked-by: David Rientjes
Cc: Jens Axboe
Cc: Jann Horn
Cc: Brian Geffon
Cc: Daniel Colascione
Cc: Joel Fernandes
Cc: Johannes Weiner
Cc: John Dias
Cc: Kirill Tkhai
Cc: Michal Hocko
Cc: Oleksandr Natalenko
Cc: Sandeep Patil
Cc: SeongJae Park
Cc: SeongJae Park
Cc: Shakeel Butt
Cc: Sonny Rao
Cc: Tim Murray
Cc: Christian Brauner
Cc: Florian Weimer
Cc:
Link: http://lkml.kernel.org/r/20200302193630.68771-5-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200622192900.22757-3-minchan@kernel.org
Link: https://lkml.kernel.org/r/20200901000633.1920247-3-minchan@kernel.org
Signed-off-by: Linus Torvalds
04 Sep, 2020
1 commit
-
Introduce PIDFD_NONBLOCK to support non-blocking pidfd file descriptors.
Ever since the introduction of pidfds and more advanced async io various
programming languages such as Rust have grown support for async event
libraries. These libraries are created to help build epoll-based event loops
around file descriptors. A common pattern is to automatically make all file
descriptors they manage to O_NONBLOCK.For such libraries the EAGAIN error code is treated specially. When a function
is called that returns EAGAIN the function isn't called again until the event
loop indicates the the file descriptor is ready. Supporting EAGAIN when
waiting on pidfds makes such libraries just work with little effort. In the
following patch we will extend waitid() internally to support non-blocking
pidfds.This introduces a new flag PIDFD_NONBLOCK that is equivalent to O_NONBLOCK.
This follows the same patterns we have for other (anon inode) file descriptors
such as EFD_NONBLOCK, IN_NONBLOCK, SFD_NONBLOCK, TFD_NONBLOCK and the same for
close-on-exec flags.Suggested-by: Josh Triplett
Signed-off-by: Christian Brauner
Reviewed-by: Josh Triplett
Reviewed-by: Oleg Nesterov
Cc: Kees Cook
Cc: Sargun Dhillon
Cc: Oleg Nesterov
Link: https://lore.kernel.org/lkml/20200811181236.GA18763@localhost/
Link: https://github.com/joshtriplett/async-pidfd
Link: https://lore.kernel.org/r/20200902102130.147672-2-christian.brauner@ubuntu.com
12 Aug, 2020
1 commit
-
- Add EXPORT_SYMBOL_GPL for find_task_by_vpid() so that drivers
can be loadable as a module.- This API is required by loadable driver module from samsung to
read process related information based on pid and thread id.
To get information on when a certain process or thread was started,
duration of run, Average load contributed by it.Signed-off-by: Abhilasha Rao
Bug: 158067689
Change-Id: I0db9cc50c93eedff0f3e9dea0ac09a5d17d118f0
05 Aug, 2020
1 commit
-
…rnel/git/brauner/linux
Pull checkpoint-restore updates from Christian Brauner:
"This enables unprivileged checkpoint/restore of processes.Given that this work has been going on for quite some time the first
sentence in this summary is hopefully more exciting than the actual
final code changes required. Unprivileged checkpoint/restore has seen
a frequent increase in interest over the last two years and has thus
been one of the main topics for the combined containers &
checkpoint/restore microconference since at least 2018 (cf. [1]).Here are just the three most frequent use-cases that were brought forward:
- The JVM developers are integrating checkpoint/restore into a Java
VM to significantly decrease the startup time.- In high-performance computing environment a resource manager will
typically be distributing jobs where users are always running as
non-root. Long-running and "large" processes with significant
startup times are supposed to be checkpointed and restored with
CRIU.- Container migration as a non-root user.
In all of these scenarios it is either desirable or required to run
without CAP_SYS_ADMIN. The userspace implementation of
checkpoint/restore CRIU already has the pull request for supporting
unprivileged checkpoint/restore up (cf. [2]).To enable unprivileged checkpoint/restore a new dedicated capability
CAP_CHECKPOINT_RESTORE is introduced. This solution has last been
discussed in 2019 in a talk by Google at Linux Plumbers (cf. [1]
"Update on Task Migration at Google Using CRIU") with Adrian and
Nicolas providing the implementation now over the last months. In
essence, this allows the CRIU binary to be installed with the
CAP_CHECKPOINT_RESTORE vfs capability set thereby enabling
unprivileged users to restore processes.To make this possible the following permissions are altered:
- Selecting a specific PID via clone3() set_tid relaxed from userns
CAP_SYS_ADMIN to CAP_CHECKPOINT_RESTORE.- Selecting a specific PID via /proc/sys/kernel/ns_last_pid relaxed
from userns CAP_SYS_ADMIN to CAP_CHECKPOINT_RESTORE.- Accessing /proc/pid/map_files relaxed from init userns
CAP_SYS_ADMIN to init userns CAP_CHECKPOINT_RESTORE.- Changing /proc/self/exe from userns CAP_SYS_ADMIN to userns
CAP_CHECKPOINT_RESTORE.Of these four changes the /proc/self/exe change deserves a few words
because the reasoning behind even restricting /proc/self/exe changes
in the first place is just full of historical quirks and tracking this
down was a questionable version of fun that I'd like to spare others.In short, it is trivial to change /proc/self/exe as an unprivileged
user, i.e. without userns CAP_SYS_ADMIN right now. Either via ptrace()
or by simply intercepting the elf loader in userspace during exec.
Nicolas was nice enough to even provide a POC for the latter (cf. [3])
to illustrate this fact.The original patchset which introduced PR_SET_MM_MAP had no
permissions around changing the exe link. They too argued that it is
trivial to spoof the exe link already which is true. The argument
brought up against this was that the Tomoyo LSM uses the exe link in
tomoyo_manager() to detect whether the calling process is a policy
manager. This caused changing the exe links to be guarded by userns
CAP_SYS_ADMIN.All in all this rather seems like a "better guard it with something
rather than nothing" argument which imho doesn't qualify as a great
security policy. Again, because spoofing the exe link is possible for
the calling process so even if this were security relevant it was
broken back then and would be broken today. So technically, dropping
all permissions around changing the exe link would probably be
possible and would send a clearer message to any userspace that relies
on /proc/self/exe for security reasons that they should stop doing
this but for now we're only relaxing the exe link permissions from
userns CAP_SYS_ADMIN to userns CAP_CHECKPOINT_RESTORE.There's a final uapi change in here. Changing the exe link used to
accidently return EINVAL when the caller lacked the necessary
permissions instead of the more correct EPERM. This pr contains a
commit fixing this. I assume that userspace won't notice or care and
if they do I will revert this commit. But since we are changing the
permissions anyway it seems like a good opportunity to try this fix.With these changes merged unprivileged checkpoint/restore will be
possible and has already been tested by various users"[1] LPC 2018
1. "Task Migration at Google Using CRIU"
https://www.youtube.com/watch?v=yI_1cuhoDgA&t=12095
2. "Securely Migrating Untrusted Workloads with CRIU"
https://www.youtube.com/watch?v=yI_1cuhoDgA&t=14400
LPC 2019
1. "CRIU and the PID dance"
https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=2m48s
2. "Update on Task Migration at Google Using CRIU"
https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=1h2m8s[2] https://github.com/checkpoint-restore/criu/pull/1155
[3] https://github.com/nviennot/run_as_exe
* tag 'cap-checkpoint-restore-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
selftests: add clone3() CAP_CHECKPOINT_RESTORE test
prctl: exe link permission error changed from -EINVAL to -EPERM
prctl: Allow local CAP_CHECKPOINT_RESTORE to change /proc/self/exe
proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE
pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
pid: use checkpoint_restore_ns_capable() for set_tid
capabilities: Introduce CAP_CHECKPOINT_RESTORE
20 Jul, 2020
1 commit
-
Use the newly introduced capability CAP_CHECKPOINT_RESTORE to allow
using clone3() with set_tid set.Signed-off-by: Adrian Reber
Signed-off-by: Nicolas Viennot
Reviewed-by: Serge Hallyn
Acked-by: Christian Brauner
Link: https://lore.kernel.org/r/20200719100418.2112740-3-areber@redhat.com
Signed-off-by: Christian Brauner
14 Jul, 2020
2 commits
-
Replace the open-coded version of receive_fd() with a call to the
new helper.Thanks to Vamshi K Sthambamkadi for
catching a missed fput() in an earlier version of this patch.Cc: Christoph Hellwig
Cc: Jakub Kicinski
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Sargun Dhillon
Acked-by: Christian Brauner
Signed-off-by: Kees Cook -
The sock counting (sock_update_netprioidx() and sock_update_classid())
was missing from pidfd's implementation of received fd installation. Add
a call to the new __receive_sock() helper.Cc: Christian Brauner
Cc: Christoph Hellwig
Cc: Sargun Dhillon
Cc: Jakub Kicinski
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: 8649c322f75c ("pid: Implement pidfd_getfd syscall")
Signed-off-by: Kees Cook
30 Apr, 2020
1 commit
-
Starting from 2c4704756cab ("pids: Move the pgrp and session pid pointers
from task_struct to signal_struct") __task_pid_nr_ns() doesn't dereference
task->group_leader, we can remove the pid_alive() check.pid_nr_ns() has to check pid != NULL anyway, pid_alive() just adds the
unnecessary confusion.Signed-off-by: Oleg Nesterov
Reviewed-by: "Eric W. Biederman"
Acked-by: Christian Brauner
Signed-off-by: Eric W. Biederman
29 Apr, 2020
1 commit
-
When the thread group leader changes during exec and the old leaders
thread is reaped proc_flush_pid will flush the dentries for the entire
process because the leader still has it's original pid.Fix this by exchanging the pids in an rcu safe manner,
and wrapping the code to do that up in a helper exchange_tids.When I removed switch_exec_pids and introduced this behavior
in d73d65293e3e ("[PATCH] pidhash: kill switch_exec_pids") there
really was nothing that cared as flushing happened with
the cached dentry and de_thread flushed both of them on exec.This lack of fully exchanging pids became a problem a few months later
when I introduced 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry
flush on exit optimization"). Which overlooked the de_thread case
was no longer swapping pids, and I was looking up proc dentries
by task->pid.The current behavior isn't properly a bug as everything in proc will
continue to work correctly just a little bit less efficiently. Fix
this just so there are no little surprise corner cases waiting to bite
people.-- Oleg points out this could be an issue in next_tgid in proc where
has_group_leader_pid is called, and reording some of the assignments
should fix that.-- Oleg points out this will break the 10 year old hack in __exit_signal.c
> /*
> * This can only happen if the caller is de_thread().
> * FIXME: this is the temporary hack, we should teach
> * posix-cpu-timers to handle this case correctly.
> */
> if (unlikely(has_group_leader_pid(tsk)))
> posix_cpu_timers_exit_group(tsk);The code in next_tgid has been changed to use PIDTYPE_TGID,
and the posix cpu timers code has been fixed so it does not
need the 10 year old hack, so this should be safe to merge
now.Link: https://lore.kernel.org/lkml/87h7x3ajll.fsf_-_@x220.int.ebiederm.org/
Acked-by: Linus Torvalds
Acked-by: Oleg Nesterov
Fixes: 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry flush on exit optimization").
Signed-off-by: Eric W. Biederman
11 Apr, 2020
1 commit
-
Pull proc fix from Eric Biederman:
"A brown paper bag slipped through my proc changes, and syzcaller
caught it when the code ended up in your tree.I have opted to fix it the simplest cleanest way I know how, so there
is no reasonable chance for the bug to repeat"* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: Use a dedicated lock in struct pid
10 Apr, 2020
1 commit
-
syzbot wrote:
> ========================================================
> WARNING: possible irq lock inversion dependency detected
> 5.6.0-syzkaller #0 Not tainted
> --------------------------------------------------------
> swapper/1/0 just changed the state of lock:
> ffffffff898090d8 (tasklist_lock){.+.?}-{2:2}, at: send_sigurg+0x9f/0x320 fs/fcntl.c:840
> but this lock took another, SOFTIRQ-unsafe lock in the past:
> (&pid->wait_pidfd){+.+.}-{2:2}
>
>
> and interrupts could create inverse lock ordering between them.
>
>
> other info that might help us debug this:
> Possible interrupt unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&pid->wait_pidfd);
> local_irq_disable();
> lock(tasklist_lock);
> lock(&pid->wait_pidfd);
>
> lock(tasklist_lock);
>
> *** DEADLOCK ***
>
> 4 locks held by swapper/1/0:The problem is that because wait_pidfd.lock is taken under the tasklist
lock. It must always be taken with irqs disabled as tasklist_lock can be
taken from interrupt context and if wait_pidfd.lock was already taken this
would create a lock order inversion.Oleg suggested just disabling irqs where I have added extra calls to
wait_pidfd.lock. That should be safe and I think the code will eventually
do that. It was rightly pointed out by Christian that sharing the
wait_pidfd.lock was a premature optimization.It is also true that my pre-merge window testing was insufficient. So
remove the premature optimization and give struct pid a dedicated lock of
it's own for struct pid things. I have verified that lockdep sees all 3
paths where we take the new pid->lock and lockdep does not complain.It is my current day dream that one day pid->lock can be used to guard the
task lists as well and then the tasklist_lock won't need to be held to
deliver signals. That will require taking pid->lock with irqs disabled.Acked-by: Christian Brauner
Link: https://lore.kernel.org/lkml/00000000000011d66805a25cd73f@google.com/
Cc: Oleg Nesterov
Cc: Christian Brauner
Reported-by: syzbot+343f75cdeea091340956@syzkaller.appspotmail.com
Reported-by: syzbot+832aabf700bc3ec920b9@syzkaller.appspotmail.com
Reported-by: syzbot+f675f964019f884dbd0f@syzkaller.appspotmail.com
Reported-by: syzbot+a9fb1457d720a55d6dc5@syzkaller.appspotmail.com
Fixes: 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc")
Signed-off-by: "Eric W. Biederman"
03 Apr, 2020
1 commit
-
Pull exec/proc updates from Eric Biederman:
"This contains two significant pieces of work: the work to sort out
proc_flush_task, and the work to solve a deadlock between strace and
exec.Fixing proc_flush_task so that it no longer requires a persistent
mount makes improvements to proc possible. The removal of the
persistent mount solves an old regression that that caused the hidepid
mount option to only work on remount not on mount. The regression was
found and reported by the Android folks. This further allows Alexey
Gladkov's work making proc mount options specific to an individual
mount of proc to move forward.The work on exec starts solving a long standing issue with exec that
it takes mutexes of blocking userspace applications, which makes exec
extremely deadlock prone. For the moment this adds a second mutex with
a narrower scope that handles all of the easy cases. Which makes the
tricky cases easy to spot. With a little luck the code to solve those
deadlocks will be ready by next merge window"* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (25 commits)
signal: Extend exec_id to 64bits
pidfd: Use new infrastructure to fix deadlocks in execve
perf: Use new infrastructure to fix deadlocks in execve
proc: io_accounting: Use new infrastructure to fix deadlocks in execve
proc: Use new infrastructure to fix deadlocks in execve
kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
kernel: doc: remove outdated comment cred.c
mm: docs: Fix a comment in process_vm_rw_core
selftests/ptrace: add test cases for dead-locks
exec: Fix a deadlock in strace
exec: Add exec_update_mutex to replace cred_guard_mutex
exec: Move exec_mmap right after de_thread in flush_old_exec
exec: Move cleanup of posix timers on exec out of de_thread
exec: Factor unshare_sighand out of de_thread and call it separately
exec: Only compute current once in flush_old_exec
pid: Improve the comment about waiting in zap_pid_ns_processes
proc: Remove the now unnecessary internal mount of proc
uml: Create a private mount of proc for mconsole
uml: Don't consult current to find the proc_mnt in mconsole_proc
proc: Use a list of inodes to flush from proc
...
25 Mar, 2020
1 commit
-
This changes __pidfd_fget to use the new exec_update_mutex
instead of cred_guard_mutex.This should be safe, as the credentials do not change
before exec_update_mutex is locked. Therefore whatever
file access is possible with holding the cred_guard_mutex
here is also possbile with the exec_update_mutex.Signed-off-by: Bernd Edlinger
Signed-off-by: Eric W. Biederman
10 Mar, 2020
1 commit
-
The alloc_pid() codepath used to be simpler. With the introducation of the
ability to choose specific pids in 49cb2fc42ce4 ("fork: extend clone3() to
support setting a PID") it got more complex. It hasn't been super obvious
that ENOMEM is returned when the pid namespace init process/child subreaper
of the pid namespace has died. As can be seen from multiple attempts to
improve this see e.g. [1] and most recently [2].
We regressed returning ENOMEM in [3] and [2] restored it. Let's add a
comment on top explaining that this is historic and documented behavior and
cannot easily be changed.[1]: 35f71bc0a09a ("fork: report pid reservation failure properly")
[2]: b26ebfe12f34 ("pid: Fix error return value in some cases")
[3]: 49cb2fc42ce4 ("fork: extend clone3() to support setting a PID")
Signed-off-by: Christian Brauner
08 Mar, 2020
1 commit
-
Recent changes to alloc_pid() allow the pid number to be specified on
the command line. If set_tid_size is set, then the code scanning the
levels will hard-set retval to -EPERM, overriding it's previous -ENOMEM
value.After the code scanning the levels, there are error returns that do not
set retval, assuming it is still set to -ENOMEM.So set retval back to -ENOMEM after scanning the levels.
Fixes: 49cb2fc42ce4 ("fork: extend clone3() to support setting a PID")
Signed-off-by: Corey Minyard
Acked-by: Christian Brauner
Cc: Andrei Vagin
Cc: Dmitry Safonov
Cc: Oleg Nesterov
Cc: Adrian Reber
Cc: # 5.5
Link: https://lore.kernel.org/r/20200306172314.12232-1-minyard@acm.org
[christian.brauner@ubuntu.com: fixup commit message]
Signed-off-by: Christian Brauner
29 Feb, 2020
1 commit
-
There remains no more code in the kernel using pids_ns->proc_mnt,
therefore remove it from the kernel.The big benefit of this change is that one of the most error prone and
tricky parts of the pid namespace implementation, maintaining kernel
mounts of proc is removed.In addition removing the unnecessary complexity of the kernel mount
fixes a regression that caused the proc mount options to be ignored.
Now that the initial mount of proc comes from userspace, those mount
options are again honored. This fixes Android's usage of the proc
hidepid option.Reported-by: Alistair Strachan
Fixes: e94591d0d90c ("proc: Convert proc_mount to use mount_ns.")
Signed-off-by: "Eric W. Biederman"
25 Feb, 2020
1 commit
-
Rework the flushing of proc to use a list of directory inodes that
need to be flushed.The list is kept on struct pid not on struct task_struct, as there is
a fixed connection between proc inodes and pids but at least for the
case of de_thread the pid of a task_struct changes.This removes the dependency on proc_mnt which allows for different
mounts of proc having different mount options even in the same pid
namespace and this allows for the removal of proc_mnt which will
trivially the first mount of proc to honor it's mount options.This flushing remains an optimization. The functions
pid_delete_dentry and pid_revalidate ensure that ordinary dcache
management will not attempt to use dentries past the point their
respective task has died. When unused the shrinker will
eventually be able to remove these dentries.There is a case in de_thread where proc_flush_pid can be
called early for a given pid. Which winds up being
safe (if suboptimal) as this is just an optiimization.Only pid directories are put on the list as the other
per pid files are children of those directories and
d_invalidate on the directory will get them as well.So that the pid can be used during flushing it's reference count is
taken in release_task and dropped in proc_flush_pid. Further the call
of proc_flush_pid is moved after the tasklist_lock is released in
release_task so that it is certain that the pid has already been
unhashed when flushing it taking place. This removes a small race
where a dentry could recreated.As struct pid is supposed to be small and I need a per pid lock
I reuse the only lock that currently exists in struct pid the
the wait_pidfd.lock.The net result is that this adds all of this functionality
with just a little extra list management overhead and
a single extra pointer in struct pid.v2: Initialize pid->inodes. I somehow failed to get that
initialization into the initial version of the patch. A boot
failure was reported by "kernel test robot ", and
failure to initialize that pid->inodes matches all of the reported
symptoms.Signed-off-by: Eric W. Biederman
14 Jan, 2020
1 commit
-
This syscall allows for the retrieval of file descriptors from other
processes, based on their pidfd. This is possible using ptrace, and
injection of parasitic code to inject code which leverages SCM_RIGHTS
to move file descriptors between a tracee and a tracer. Unfortunately,
ptrace comes with a high cost of requiring the process to be stopped,
and breaks debuggers. This does not require stopping the process under
manipulation.One reason to use this is to allow sandboxers to take actions on file
descriptors on the behalf of another process. For example, this can be
combined with seccomp-bpf's user notification to do on-demand fd
extraction and take privileged actions. One such privileged action
is binding a socket to a privileged port./* prototype */
/* flags is currently reserved and should be set to 0 */
int sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);/* testing */
Ran self-test suite on x86_64Signed-off-by: Sargun Dhillon
Acked-by: Christian Brauner
Reviewed-by: Arnd Bergmann
Link: https://lore.kernel.org/r/20200107175927.4558-3-sargun@sargun.me
Signed-off-by: Christian Brauner
16 Nov, 2019
1 commit
-
The main motivation to add set_tid to clone3() is CRIU.
To restore a process with the same PID/TID CRIU currently uses
/proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
ns_last_pid and then (quickly) does a clone(). This works most of the
time, but it is racy. It is also slow as it requires multiple syscalls.Extending clone3() to support *set_tid makes it possible restore a
process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
race free (as long as the desired PID/TID is available).This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
on clone3() with *set_tid as they are currently in place for ns_last_pid.The original version of this change was using a single value for
set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
decided to change set_tid to an array to enable setting the PID of a
process in multiple PID namespaces at the same time. If a process is
created in a PID namespace it is possible to influence the PID inside
and outside of the PID namespace. Details also in the corresponding
selftest.To create a process with the following PIDs:
PID NS level Requested PID
0 (host) 31496
1 42
2 1For that example the two newly introduced parameters to struct
clone_args (set_tid and set_tid_size) would need to be:set_tid[0] = 1;
set_tid[1] = 42;
set_tid[2] = 31496;
set_tid_size = 3;If only the PIDs of the two innermost nested PID namespaces should be
defined it would look like this:set_tid[0] = 1;
set_tid[1] = 42;
set_tid_size = 2;The PID of the newly created process would then be the next available
free PID in the PID namespace level 0 (host) and 42 in the PID namespace
at level 1 and the PID of the process in the innermost PID namespace
would be 1.The set_tid array is used to specify the PID of a process starting
from the innermost nested PID namespaces up to set_tid_size PID namespaces.set_tid_size cannot be larger then the current PID namespace level.
Signed-off-by: Adrian Reber
Reviewed-by: Christian Brauner
Reviewed-by: Oleg Nesterov
Reviewed-by: Dmitry Safonov
Acked-by: Andrei Vagin
Link: https://lore.kernel.org/r/20191115123621.142252-1-areber@redhat.com
Signed-off-by: Christian Brauner
17 Oct, 2019
2 commits
-
Use the new pid_has_task() helper in pidfd_open(). This simplifies the
code and avoids taking rcu_read_{lock,unlock}() and leads to overall
nicer code.Signed-off-by: Christian Brauner
Reviewed-by: Oleg Nesterov
Link: https://lore.kernel.org/r/20191017101832.5985-5-christian.brauner@ubuntu.com -
Replace hlist_empty() with the new pid_has_task() helper which is more
idiomatic, easier to grep for, and unifies how callers perform this
check.Signed-off-by: Christian Brauner
Reviewed-by: Oleg Nesterov
Link: https://lore.kernel.org/r/20191017101832.5985-3-christian.brauner@ubuntu.com
17 Jul, 2019
1 commit
-
struct pid's count is an atomic_t field used as a refcount. Use
refcount_t for it which is basically atomic_t but does additional
checking to prevent use-after-free bugs.For memory ordering, the only change is with the following:
- if ((atomic_read(&pid->count) == 1) ||
- atomic_dec_and_test(&pid->count)) {
+ if (refcount_dec_and_test(&pid->count)) {
kmem_cache_free(ns->pid_cachep, pid);Here the change is from: Fully ordered --> RELEASE + ACQUIRE (as per
refcount-vs-atomic.rst) This ACQUIRE should take care of making sure the
free happens after the refcount_dec_and_test().The above hunk also removes atomic_read() since it is not needed for the
code to work and it is unclear how beneficial it is. The removal lets
refcount_dec_and_test() check for cases where get_pid() happened before
the object was freed.Link: http://lkml.kernel.org/r/20190701183826.191936-1-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google)
Reviewed-by: Andrea Parri
Reviewed-by: Kees Cook
Cc: Mathieu Desnoyers
Cc: Matthew Wilcox
Cc: Peter Zijlstra
Cc: Will Deacon
Cc: Paul E. McKenney
Cc: Elena Reshetova
Cc: Jann Horn
Cc: Eric W. Biederman
Cc: KJ Tsanaktsidis
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 Jul, 2019
1 commit
-
Pull pidfd updates from Christian Brauner:
"This adds two main features.- First, it adds polling support for pidfds. This allows process
managers to know when a (non-parent) process dies in a race-free
way.The notification mechanism used follows the same logic that is
currently used when the parent of a task is notified of a child's
death. With this patchset it is possible to put pidfds in an
{e}poll loop and get reliable notifications for process (i.e.
thread-group) exit.- The second feature compliments the first one by making it possible
to retrieve pollable pidfds for processes that were not created
using CLONE_PIDFD.A lot of processes get created with traditional PID-based calls
such as fork() or clone() (without CLONE_PIDFD). For these
processes a caller can currently not create a pollable pidfd. This
is a problem for Android's low memory killer (LMK) and service
managers such as systemd.Both patchsets are accompanied by selftests.
It's perhaps worth noting that the work done so far and the work done
in this branch for pidfd_open() and polling support do already see
some adoption:- Android is in the process of backporting this work to all their LTS
kernels [1]- Service managers make use of pidfd_send_signal but will need to
wait until we enable waiting on pidfds for full adoption.- And projects I maintain make use of both pidfd_send_signal and
CLONE_PIDFD [2] and will use polling support and pidfd_open() too"[1] https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.9+backport%22
https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.14+backport%22
https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.19+backport%22[2] https://github.com/lxc/lxc/blob/aab6e3eb73c343231cdde775db938994fc6f2803/src/lxc/start.c#L1753
* tag 'pidfd-updates-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
tests: add pidfd_open() tests
arch: wire-up pidfd_open()
pid: add pidfd_open()
pidfd: add polling selftests
pidfd: add polling support
28 Jun, 2019
2 commits
-
This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
process that is created via traditional fork()/clone() calls that is only
referenced by a PID:int pidfd = pidfd_open(1234, 0);
ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);With the introduction of pidfds through CLONE_PIDFD it is possible to
created pidfds at process creation time.
However, a lot of processes get created with traditional PID-based calls
such as fork() or clone() (without CLONE_PIDFD). For these processes a
caller can currently not create a pollable pidfd. This is a problem for
Android's low memory killer (LMK) and service managers such as systemd.
Both are examples of tools that want to make use of pidfds to get reliable
notification of process exit for non-parents (pidfd polling) and race-free
signal sending (pidfd_send_signal()). They intend to switch to this API for
process supervision/management as soon as possible. Having no way to get
pollable pidfds from PID-only processes is one of the biggest blockers for
them in adopting this api. With pidfd_open() making it possible to retrieve
pidfds for PID-based processes we enable them to adopt this api.In line with Arnd's recent changes to consolidate syscall numbers across
architectures, I have added the pidfd_open() syscall to all architectures
at the same time.Signed-off-by: Christian Brauner
Reviewed-by: David Howells
Reviewed-by: Oleg Nesterov
Acked-by: Arnd Bergmann
Cc: "Eric W. Biederman"
Cc: Kees Cook
Cc: Joel Fernandes (Google)
Cc: Thomas Gleixner
Cc: Jann Horn
Cc: Andy Lutomirsky
Cc: Andrew Morton
Cc: Aleksa Sarai
Cc: Linus Torvalds
Cc: Al Viro
Cc: linux-api@vger.kernel.org -
This patch adds polling support to pidfd.
Android low memory killer (LMK) needs to know when a process dies once
it is sent the kill signal. It does so by checking for the existence of
/proc/pid which is both racy and slow. For example, if a PID is reused
between when LMK sends a kill signal and checks for existence of the
PID, since the wrong PID is now possibly checked for existence.
Using the polling support, LMK will be able to get notified when a process
exists in race-free and fast way, and allows the LMK to do other things
(such as by polling on other fds) while awaiting the process being killed
to die.For notification to polling processes, we follow the same existing
mechanism in the kernel used when the parent of the task group is to be
notified of a child's death (do_notify_parent). This is precisely when the
tasks waiting on a poll of pidfd are also awakened in this patch.We have decided to include the waitqueue in struct pid for the following
reasons:
1. The wait queue has to survive for the lifetime of the poll. Including
it in task_struct would not be option in this case because the task can
be reaped and destroyed before the poll returns.2. By including the struct pid for the waitqueue means that during
de_thread(), the new thread group leader automatically gets the new
waitqueue/pid even though its task_struct is different.Appropriate test cases are added in the second patch to provide coverage of
all the cases the patch is handling.Cc: Andy Lutomirski
Cc: Steven Rostedt
Cc: Daniel Colascione
Cc: Jann Horn
Cc: Tim Murray
Cc: Jonathan Kowalski
Cc: Linus Torvalds
Cc: Al Viro
Cc: Kees Cook
Cc: David Howells
Cc: Oleg Nesterov
Cc: kernel-team@android.com
Reviewed-by: Oleg Nesterov
Co-developed-by: Daniel Colascione
Signed-off-by: Daniel Colascione
Signed-off-by: Joel Fernandes (Google)
Signed-off-by: Christian Brauner
21 May, 2019
1 commit
-
Add SPDX license identifiers to all files which:
- Have no license information of any form
- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the fileThese files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:GPL-2.0-only
Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman
15 May, 2019
1 commit
-
Hash functions are not needed since idr is used now. Let's remove hash
header file for cleanup.Link: http://lkml.kernel.org/r/20190430053319.95913-1-scuttimmy@gmail.com
Signed-off-by: Timmy Li
Cc: "Eric W. Biederman"
Cc: Michal Hocko
Cc: Matthew Wilcox
Cc: Oleg Nesterov
Cc: Mike Rapoport
Cc: KJ Tsanaktsidis
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
29 Dec, 2018
1 commit
-
The failure path removes the allocated PIDs from the wrong namespace.
This could lead to us inadvertently reusing PIDs in the leaf namespace
and leaking PIDs in parent namespaces.Fixes: 95846ecf9dac ("pid: replace pid bitmap implementation with IDR API")
Cc:
Signed-off-by: Matthew Wilcox
Acked-by: "Eric W. Biederman"
Reviewed-by: Oleg Nesterov
Signed-off-by: Linus Torvalds
31 Oct, 2018
1 commit
-
Move remaining definitions and declarations from include/linux/bootmem.h
into include/linux/memblock.h and remove the redundant header.The includes were replaced with the semantic patch below and then
semi-automated removal of duplicated '#include@@
@@
- #include
+ #include[sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
[sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
[sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Signed-off-by: Stephen Rothwell
Acked-by: Michal Hocko
Cc: Catalin Marinas
Cc: Chris Zankel
Cc: "David S. Miller"
Cc: Geert Uytterhoeven
Cc: Greentime Hu
Cc: Greg Kroah-Hartman
Cc: Guan Xuetao
Cc: Ingo Molnar
Cc: "James E.J. Bottomley"
Cc: Jonas Bonn
Cc: Jonathan Corbet
Cc: Ley Foon Tan
Cc: Mark Salter
Cc: Martin Schwidefsky
Cc: Matt Turner
Cc: Michael Ellerman
Cc: Michal Simek
Cc: Palmer Dabbelt
Cc: Paul Burton
Cc: Richard Kuo
Cc: Richard Weinberger
Cc: Rich Felker
Cc: Russell King
Cc: Serge Semin
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Vineet Gupta
Cc: Yoshinori Sato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
21 Sep, 2018
1 commit
-
Make the clone and fork syscalls return EAGAIN when the limit on the
number of pids /proc/sys/kernel/pid_max is exceeded.Currently, when the pid_max limit is exceeded, the kernel will return
ENOSPC from the fork and clone syscalls. This is contrary to the
documented behaviour, which explicitly calls out the pid_max case as one
where EAGAIN should be returned. It also leads to really confusing error
messages in userspace programs which will complain about a lack of disk
space when they fail to create processes/threads for this reason.This error is being returned because alloc_pid() uses the idr api to find
a new pid; when there are none available, idr_alloc_cyclic() returns
-ENOSPC, and this is being propagated back to userspace.This behaviour has been broken before, and was explicitly fixed in
commit 35f71bc0a09a ("fork: report pid reservation failure properly"),
so I think -EAGAIN is definitely the right thing to return in this case.
The current behaviour change dates from commit 95846ecf9dac ("pid:
replace pid bitmap implementation with IDR AIP") and was I believe
unintentional.This patch has no impact on the case where allocating a pid fails because
the child reaper for the namespace is dead; that case will still return
-ENOMEM.Link: http://lkml.kernel.org/r/20180903111016.46461-1-ktsanaktsidis@zendesk.com
Fixes: 95846ecf9dac ("pid: replace pid bitmap implementation with IDR AIP")
Signed-off-by: KJ Tsanaktsidis
Reviewed-by: Andrew Morton
Acked-by: Michal Hocko
Cc: Gargi Sharma
Cc: Rik van Riel
Cc: Oleg Nesterov
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman
21 Jul, 2018
3 commits
-
Everywhere except in the pid array we distinguish between a tasks pid and
a tasks tgid (thread group id). Even in the enumeration we want that
distinction sometimes so we have added __PIDTYPE_TGID. With leader_pid
we almost have an implementation of PIDTYPE_TGID in struct signal_struct.Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
into the pids array. Then remove the __PIDTYPE_TGID special case and the
leader_pid in signal_struct.The net size increase is just an extra pointer added to struct pid and
an extra pair of pointers of an hlist_node added to task_struct.The effect on code maintenance is the removal of a number of special
cases today and the potential to remove many more special cases as
PIDTYPE_TGID gets used to it's fullest. The long term potential
is allowing zombie thread group leaders to exit, which will remove
a lot more special cases in the code.Signed-off-by: "Eric W. Biederman"
-
To access these fields the code always has to go to group leader so
going to signal struct is no loss and is actually a fundamental simplification.This saves a little bit of memory by only allocating the pid pointer array
once instead of once for every thread, and even better this removes a
few potential races caused by the fact that group_leader can be changed
by de_thread, while signal_struct can not.Signed-off-by: "Eric W. Biederman"
-
The cost is the the same and this removes the need
to worry about complications that come from de_thread
and group_leader changing.__task_pid_nr_ns has been updated to take advantage of this change.
Signed-off-by: "Eric W. Biederman"
12 Apr, 2018
1 commit
-
This results in no change in structure size on 64-bit machines as it
fits in the padding between the gfp_t and the void *. 32-bit machines
will grow the structure from 8 to 12 bytes. Almost all radix trees are
protected with (at least) a spinlock, so as they are converted from
radix trees to xarrays, the data structures will shrink again.Initialising the spinlock requires a name for the benefit of lockdep, so
RADIX_TREE_INIT() now needs to know the name of the radix tree it's
initialising, and so do IDR_INIT() and IDA_INIT().Also add the xa_lock() and xa_unlock() family of wrappers to make it
easier to use the lock. If we could rely on -fplan9-extensions in the
compiler, we could avoid all of this syntactic sugar, but that wasn't
added until gcc 4.6.Link: http://lkml.kernel.org/r/20180313132639.17387-8-willy@infradead.org
Signed-off-by: Matthew Wilcox
Reviewed-by: Jeff Layton
Cc: Darrick J. Wong
Cc: Dave Chinner
Cc: Ryusuke Konishi
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Feb, 2018
1 commit
-
There are several functions that do find_task_by_vpid() followed by
get_task_struct(). We can use a helper function instead.Link: http://lkml.kernel.org/r/1509602027-11337-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Acked-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
30 Jan, 2018
1 commit
-
Pull init_task initializer cleanups from David Howells:
"It doesn't seem useful to have the init_task in a header file rather
than in a normal source file. We could consolidate init_task handling
instead and expand out various macros.Here's a series of patches that consolidate init_task handling:
(1) Make THREAD_SIZE available to vmlinux.lds for cris, hexagon and
openrisc.(2) Alter the INIT_TASK_DATA linker script macro to set
init_thread_union and init_stack rather than defining these in C.Insert init_task and init_thread_into into the init_stack area in
the linker script as appropriate to the configuration, with
different section markers so that they end up correctly ordered.We can then get merge ia64's init_task.c into the main one.
We then have a bunch of single-use INIT_*() macros that seem only
to be macros because they used to be used per-arch. We can then
expand these in place of the user and get rid of a few lines and
a lot of backslashes.(3) Expand INIT_TASK() in place.
(4) Expand in place various small INIT_*() macros that are defined
conditionally. Expand them and surround them by #if[n]def/#endif
in the .c file as it takes fewer lines.(5) Expand INIT_SIGNALS() and INIT_SIGHAND() in place.
(6) Expand INIT_STRUCT_PID in place.
These macros can then be discarded"
* tag 'init_task-20180117' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
Expand INIT_STRUCT_PID and remove
Expand the INIT_SIGNALS and INIT_SIGHAND macros and remove
Expand various INIT_* macros and remove
Expand INIT_TASK() in init/init_task.c and remove
Construct init thread stack in the linker script rather than by union
openrisc: Make THREAD_SIZE available to vmlinux.lds
hexagon: Make THREAD_SIZE available to vmlinux.lds
cris: Make THREAD_SIZE available to vmlinux.lds