19 Oct, 2020

1 commit

  • process_madvise syscall needs pidfd_get_pid function to translate pidfd to
    pid so this patch move the function to kernel/pid.c.

    Suggested-by: Alexander Duyck
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Suren Baghdasaryan
    Reviewed-by: Alexander Duyck
    Reviewed-by: Vlastimil Babka
    Acked-by: Christian Brauner
    Acked-by: David Rientjes
    Cc: Jens Axboe
    Cc: Jann Horn
    Cc: Brian Geffon
    Cc: Daniel Colascione
    Cc: Joel Fernandes
    Cc: Johannes Weiner
    Cc: John Dias
    Cc: Kirill Tkhai
    Cc: Michal Hocko
    Cc: Oleksandr Natalenko
    Cc: Sandeep Patil
    Cc: SeongJae Park
    Cc: SeongJae Park
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Tim Murray
    Cc: Christian Brauner
    Cc: Florian Weimer
    Cc:
    Link: http://lkml.kernel.org/r/20200302193630.68771-5-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200622192900.22757-3-minchan@kernel.org
    Link: https://lkml.kernel.org/r/20200901000633.1920247-3-minchan@kernel.org
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

05 Jun, 2020

1 commit

  • Pull proc updates from Eric Biederman:
    "This has four sets of changes:

    - modernize proc to support multiple private instances

    - ensure we see the exit of each process tid exactly

    - remove has_group_leader_pid

    - use pids not tasks in posix-cpu-timers lookup

    Alexey updated proc so each mount of proc uses a new superblock. This
    allows people to actually use mount options with proc with no fear of
    messing up another mount of proc. Given the kernel's internal mounts
    of proc for things like uml this was a real problem, and resulted in
    Android's hidepid mount options being ignored and introducing security
    issues.

    The rest of the changes are small cleanups and fixes that came out of
    my work to allow this change to proc. In essence it is swapping the
    pids in de_thread during exec which removes a special case the code
    had to handle. Then updating the code to stop handling that special
    case"

    * 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    proc: proc_pid_ns takes super_block as an argument
    remove the no longer needed pid_alive() check in __task_pid_nr_ns()
    posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock
    posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type
    posix-cpu-timers: Extend rcu_read_lock removing task_struct references
    signal: Remove has_group_leader_pid
    exec: Remove BUG_ON(has_group_leader_pid)
    posix-cpu-timer: Unify the now redundant code in lookup_task
    posix-cpu-timer: Tidy up group_leader logic in lookup_task
    proc: Ensure we see the exit of each process tid exactly once
    rculist: Add hlists_swap_heads_rcu
    proc: Use PIDTYPE_TGID in next_tgid
    Use proc_pid_ns() to get pid_namespace from the proc superblock
    proc: use named enums for better readability
    proc: use human-readable values for hidepid
    docs: proc: add documentation for "hidepid=4" and "subset=pid" options and new mount behavior
    proc: add option to mount only a pids subset
    proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option
    proc: allow to mount many instances of proc in one pid namespace
    proc: rename struct proc_fs_info to proc_fs_opts

    Linus Torvalds
     

29 Apr, 2020

1 commit

  • When the thread group leader changes during exec and the old leaders
    thread is reaped proc_flush_pid will flush the dentries for the entire
    process because the leader still has it's original pid.

    Fix this by exchanging the pids in an rcu safe manner,
    and wrapping the code to do that up in a helper exchange_tids.

    When I removed switch_exec_pids and introduced this behavior
    in d73d65293e3e ("[PATCH] pidhash: kill switch_exec_pids") there
    really was nothing that cared as flushing happened with
    the cached dentry and de_thread flushed both of them on exec.

    This lack of fully exchanging pids became a problem a few months later
    when I introduced 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry
    flush on exit optimization"). Which overlooked the de_thread case
    was no longer swapping pids, and I was looking up proc dentries
    by task->pid.

    The current behavior isn't properly a bug as everything in proc will
    continue to work correctly just a little bit less efficiently. Fix
    this just so there are no little surprise corner cases waiting to bite
    people.

    -- Oleg points out this could be an issue in next_tgid in proc where
    has_group_leader_pid is called, and reording some of the assignments
    should fix that.

    -- Oleg points out this will break the 10 year old hack in __exit_signal.c
    > /*
    > * This can only happen if the caller is de_thread().
    > * FIXME: this is the temporary hack, we should teach
    > * posix-cpu-timers to handle this case correctly.
    > */
    > if (unlikely(has_group_leader_pid(tsk)))
    > posix_cpu_timers_exit_group(tsk);

    The code in next_tgid has been changed to use PIDTYPE_TGID,
    and the posix cpu timers code has been fixed so it does not
    need the 10 year old hack, so this should be safe to merge
    now.

    Link: https://lore.kernel.org/lkml/87h7x3ajll.fsf_-_@x220.int.ebiederm.org/
    Acked-by: Linus Torvalds
    Acked-by: Oleg Nesterov
    Fixes: 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry flush on exit optimization").
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

27 Apr, 2020

1 commit


10 Apr, 2020

1 commit

  • syzbot wrote:
    > ========================================================
    > WARNING: possible irq lock inversion dependency detected
    > 5.6.0-syzkaller #0 Not tainted
    > --------------------------------------------------------
    > swapper/1/0 just changed the state of lock:
    > ffffffff898090d8 (tasklist_lock){.+.?}-{2:2}, at: send_sigurg+0x9f/0x320 fs/fcntl.c:840
    > but this lock took another, SOFTIRQ-unsafe lock in the past:
    > (&pid->wait_pidfd){+.+.}-{2:2}
    >
    >
    > and interrupts could create inverse lock ordering between them.
    >
    >
    > other info that might help us debug this:
    > Possible interrupt unsafe locking scenario:
    >
    > CPU0 CPU1
    > ---- ----
    > lock(&pid->wait_pidfd);
    > local_irq_disable();
    > lock(tasklist_lock);
    > lock(&pid->wait_pidfd);
    >
    > lock(tasklist_lock);
    >
    > *** DEADLOCK ***
    >
    > 4 locks held by swapper/1/0:

    The problem is that because wait_pidfd.lock is taken under the tasklist
    lock. It must always be taken with irqs disabled as tasklist_lock can be
    taken from interrupt context and if wait_pidfd.lock was already taken this
    would create a lock order inversion.

    Oleg suggested just disabling irqs where I have added extra calls to
    wait_pidfd.lock. That should be safe and I think the code will eventually
    do that. It was rightly pointed out by Christian that sharing the
    wait_pidfd.lock was a premature optimization.

    It is also true that my pre-merge window testing was insufficient. So
    remove the premature optimization and give struct pid a dedicated lock of
    it's own for struct pid things. I have verified that lockdep sees all 3
    paths where we take the new pid->lock and lockdep does not complain.

    It is my current day dream that one day pid->lock can be used to guard the
    task lists as well and then the tasklist_lock won't need to be held to
    deliver signals. That will require taking pid->lock with irqs disabled.

    Acked-by: Christian Brauner
    Link: https://lore.kernel.org/lkml/00000000000011d66805a25cd73f@google.com/
    Cc: Oleg Nesterov
    Cc: Christian Brauner
    Reported-by: syzbot+343f75cdeea091340956@syzkaller.appspotmail.com
    Reported-by: syzbot+832aabf700bc3ec920b9@syzkaller.appspotmail.com
    Reported-by: syzbot+f675f964019f884dbd0f@syzkaller.appspotmail.com
    Reported-by: syzbot+a9fb1457d720a55d6dc5@syzkaller.appspotmail.com
    Fixes: 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

25 Feb, 2020

1 commit

  • Rework the flushing of proc to use a list of directory inodes that
    need to be flushed.

    The list is kept on struct pid not on struct task_struct, as there is
    a fixed connection between proc inodes and pids but at least for the
    case of de_thread the pid of a task_struct changes.

    This removes the dependency on proc_mnt which allows for different
    mounts of proc having different mount options even in the same pid
    namespace and this allows for the removal of proc_mnt which will
    trivially the first mount of proc to honor it's mount options.

    This flushing remains an optimization. The functions
    pid_delete_dentry and pid_revalidate ensure that ordinary dcache
    management will not attempt to use dentries past the point their
    respective task has died. When unused the shrinker will
    eventually be able to remove these dentries.

    There is a case in de_thread where proc_flush_pid can be
    called early for a given pid. Which winds up being
    safe (if suboptimal) as this is just an optiimization.

    Only pid directories are put on the list as the other
    per pid files are children of those directories and
    d_invalidate on the directory will get them as well.

    So that the pid can be used during flushing it's reference count is
    taken in release_task and dropped in proc_flush_pid. Further the call
    of proc_flush_pid is moved after the tasklist_lock is released in
    release_task so that it is certain that the pid has already been
    unhashed when flushing it taking place. This removes a small race
    where a dentry could recreated.

    As struct pid is supposed to be small and I need a per pid lock
    I reuse the only lock that currently exists in struct pid the
    the wait_pidfd.lock.

    The net result is that this adds all of this functionality
    with just a little extra list management overhead and
    a single extra pointer in struct pid.

    v2: Initialize pid->inodes. I somehow failed to get that
    initialization into the initial version of the patch. A boot
    failure was reported by "kernel test robot ", and
    failure to initialize that pid->inodes matches all of the reported
    symptoms.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

16 Nov, 2019

1 commit

  • The main motivation to add set_tid to clone3() is CRIU.

    To restore a process with the same PID/TID CRIU currently uses
    /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
    ns_last_pid and then (quickly) does a clone(). This works most of the
    time, but it is racy. It is also slow as it requires multiple syscalls.

    Extending clone3() to support *set_tid makes it possible restore a
    process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
    race free (as long as the desired PID/TID is available).

    This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
    on clone3() with *set_tid as they are currently in place for ns_last_pid.

    The original version of this change was using a single value for
    set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
    decided to change set_tid to an array to enable setting the PID of a
    process in multiple PID namespaces at the same time. If a process is
    created in a PID namespace it is possible to influence the PID inside
    and outside of the PID namespace. Details also in the corresponding
    selftest.

    To create a process with the following PIDs:

    PID NS level Requested PID
    0 (host) 31496
    1 42
    2 1

    For that example the two newly introduced parameters to struct
    clone_args (set_tid and set_tid_size) would need to be:

    set_tid[0] = 1;
    set_tid[1] = 42;
    set_tid[2] = 31496;
    set_tid_size = 3;

    If only the PIDs of the two innermost nested PID namespaces should be
    defined it would look like this:

    set_tid[0] = 1;
    set_tid[1] = 42;
    set_tid_size = 2;

    The PID of the newly created process would then be the next available
    free PID in the PID namespace level 0 (host) and 42 in the PID namespace
    at level 1 and the PID of the process in the innermost PID namespace
    would be 1.

    The set_tid array is used to specify the PID of a process starting
    from the innermost nested PID namespaces up to set_tid_size PID namespaces.

    set_tid_size cannot be larger then the current PID namespace level.

    Signed-off-by: Adrian Reber
    Reviewed-by: Christian Brauner
    Reviewed-by: Oleg Nesterov
    Reviewed-by: Dmitry Safonov
    Acked-by: Andrei Vagin
    Link: https://lore.kernel.org/r/20191115123621.142252-1-areber@redhat.com
    Signed-off-by: Christian Brauner

    Adrian Reber
     

17 Oct, 2019

1 commit

  • Currently, when a task is dead we still print the pid it used to use in
    the fdinfo files of its pidfds. This doesn't make much sense since the
    pid may have already been reused. So verify that the task is still alive
    by introducing the pid_has_task() helper which will be used by other
    callers in follow-up patches.
    If the task is not alive anymore, we will print -1. This allows us to
    differentiate between a task not being present in a given pid namespace
    - in which case we already print 0 - and a task having been reaped.

    Note that this uses PIDTYPE_PID for the check. Technically, we could've
    checked PIDTYPE_TGID since pidfds currently only refer to thread-group
    leaders but if they won't anymore in the future then this check becomes
    problematic without it being immediately obvious to non-experts imho. If
    a thread is created via clone(CLONE_THREAD) than struct pid has a single
    non-empty list pid->tasks[PIDTYPE_PID] and this pid can't be used as a
    PIDTYPE_TGID meaning pid->tasks[PIDTYPE_TGID] will return NULL even
    though the thread-group leader might still be very much alive. So
    checking PIDTYPE_PID is fine and is easier to maintain should we ever
    allow pidfds to refer to threads.

    Cc: Jann Horn
    Cc: Christian Kellner
    Cc: linux-api@vger.kernel.org
    Signed-off-by: Christian Brauner
    Reviewed-by: Oleg Nesterov
    Link: https://lore.kernel.org/r/20191017101832.5985-1-christian.brauner@ubuntu.com

    Christian Brauner
     

02 Aug, 2019

1 commit

  • This adds the P_PIDFD type to waitid().
    One of the last remaining bits for the pidfd api is to make it possible
    to wait on pidfds. With P_PIDFD added to waitid() the parts of userspace
    that want to use the pidfd api to exclusively manage processes can do so
    now.

    One of the things this will unblock in the future is the ability to make
    it possible to retrieve the exit status via waitid(P_PIDFD) for
    non-parent processes if handed a _suitable_ pidfd that has this feature
    set. This is similar to what you can do on FreeBSD with kqueue(). It
    might even end up being possible to wait on a process as a non-parent if
    an appropriate property is enabled on the pidfd.

    With P_PIDFD no scoping of the process identified by the pidfd is
    possible, i.e. it explicitly blocks things such as wait4(-1), wait4(0),
    waitid(P_ALL), waitid(P_PGID) etc. It only allows for semantics
    equivalent to wait4(pid), waitid(P_PID). Users that need scoping should
    rely on pid-based wait*() syscalls for now.

    Signed-off-by: Christian Brauner
    Reviewed-by: Kees Cook
    Reviewed-by: Oleg Nesterov
    Cc: Arnd Bergmann
    Cc: "Eric W. Biederman"
    Cc: Joel Fernandes (Google)
    Cc: Thomas Gleixner
    Cc: David Howells
    Cc: Jann Horn
    Cc: Andy Lutomirsky
    Cc: Andrew Morton
    Cc: Aleksa Sarai
    Cc: Linus Torvalds
    Cc: Al Viro
    Link: https://lore.kernel.org/r/20190727222229.6516-2-christian@brauner.io

    Christian Brauner
     

17 Jul, 2019

1 commit

  • struct pid's count is an atomic_t field used as a refcount. Use
    refcount_t for it which is basically atomic_t but does additional
    checking to prevent use-after-free bugs.

    For memory ordering, the only change is with the following:

    - if ((atomic_read(&pid->count) == 1) ||
    - atomic_dec_and_test(&pid->count)) {
    + if (refcount_dec_and_test(&pid->count)) {
    kmem_cache_free(ns->pid_cachep, pid);

    Here the change is from: Fully ordered --> RELEASE + ACQUIRE (as per
    refcount-vs-atomic.rst) This ACQUIRE should take care of making sure the
    free happens after the refcount_dec_and_test().

    The above hunk also removes atomic_read() since it is not needed for the
    code to work and it is unclear how beneficial it is. The removal lets
    refcount_dec_and_test() check for cases where get_pid() happened before
    the object was freed.

    Link: http://lkml.kernel.org/r/20190701183826.191936-1-joel@joelfernandes.org
    Signed-off-by: Joel Fernandes (Google)
    Reviewed-by: Andrea Parri
    Reviewed-by: Kees Cook
    Cc: Mathieu Desnoyers
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Will Deacon
    Cc: Paul E. McKenney
    Cc: Elena Reshetova
    Cc: Jann Horn
    Cc: Eric W. Biederman
    Cc: KJ Tsanaktsidis
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joel Fernandes (Google)
     

28 Jun, 2019

1 commit

  • This patch adds polling support to pidfd.

    Android low memory killer (LMK) needs to know when a process dies once
    it is sent the kill signal. It does so by checking for the existence of
    /proc/pid which is both racy and slow. For example, if a PID is reused
    between when LMK sends a kill signal and checks for existence of the
    PID, since the wrong PID is now possibly checked for existence.
    Using the polling support, LMK will be able to get notified when a process
    exists in race-free and fast way, and allows the LMK to do other things
    (such as by polling on other fds) while awaiting the process being killed
    to die.

    For notification to polling processes, we follow the same existing
    mechanism in the kernel used when the parent of the task group is to be
    notified of a child's death (do_notify_parent). This is precisely when the
    tasks waiting on a poll of pidfd are also awakened in this patch.

    We have decided to include the waitqueue in struct pid for the following
    reasons:
    1. The wait queue has to survive for the lifetime of the poll. Including
    it in task_struct would not be option in this case because the task can
    be reaped and destroyed before the poll returns.

    2. By including the struct pid for the waitqueue means that during
    de_thread(), the new thread group leader automatically gets the new
    waitqueue/pid even though its task_struct is different.

    Appropriate test cases are added in the second patch to provide coverage of
    all the cases the patch is handling.

    Cc: Andy Lutomirski
    Cc: Steven Rostedt
    Cc: Daniel Colascione
    Cc: Jann Horn
    Cc: Tim Murray
    Cc: Jonathan Kowalski
    Cc: Linus Torvalds
    Cc: Al Viro
    Cc: Kees Cook
    Cc: David Howells
    Cc: Oleg Nesterov
    Cc: kernel-team@android.com
    Reviewed-by: Oleg Nesterov
    Co-developed-by: Daniel Colascione
    Signed-off-by: Daniel Colascione
    Signed-off-by: Joel Fernandes (Google)
    Signed-off-by: Christian Brauner

    Joel Fernandes (Google)
     

07 May, 2019

1 commit

  • This patchset makes it possible to retrieve pid file descriptors at
    process creation time by introducing the new flag CLONE_PIDFD to the
    clone() system call. Linus originally suggested to implement this as a
    new flag to clone() instead of making it a separate system call. As
    spotted by Linus, there is exactly one bit for clone() left.

    CLONE_PIDFD creates file descriptors based on the anonymous inode
    implementation in the kernel that will also be used to implement the new
    mount api. They serve as a simple opaque handle on pids. Logically,
    this makes it possible to interpret a pidfd differently, narrowing or
    widening the scope of various operations (e.g. signal sending). Thus, a
    pidfd cannot just refer to a tgid, but also a tid, or in theory - given
    appropriate flag arguments in relevant syscalls - a process group or
    session. A pidfd does not represent a privilege. This does not imply it
    cannot ever be that way but for now this is not the case.

    A pidfd comes with additional information in fdinfo if the kernel supports
    procfs. The fdinfo file contains the pid of the process in the callers
    pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d".

    As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the
    parent_tidptr argument of clone. This has the advantage that we can
    give back the associated pid and the pidfd at the same time.

    To remove worries about missing metadata access this patchset comes with
    a sample program that illustrates how a combination of CLONE_PIDFD, and
    pidfd_send_signal() can be used to gain race-free access to process
    metadata through /proc/. The sample program can easily be
    translated into a helper that would be suitable for inclusion in libc so
    that users don't have to worry about writing it themselves.

    Suggested-by: Linus Torvalds
    Signed-off-by: Christian Brauner
    Co-developed-by: Jann Horn
    Signed-off-by: Jann Horn
    Reviewed-by: Oleg Nesterov
    Cc: Arnd Bergmann
    Cc: "Eric W. Biederman"
    Cc: Kees Cook
    Cc: Thomas Gleixner
    Cc: David Howells
    Cc: "Michael Kerrisk (man-pages)"
    Cc: Andy Lutomirsky
    Cc: Andrew Morton
    Cc: Aleksa Sarai
    Cc: Linus Torvalds
    Cc: Al Viro

    Christian Brauner
     

08 Mar, 2019

1 commit

  • Commit 95846ecf9dac ("pid: replace pid bitmap implementation with IDR
    API") removed next_pidmap() but left its declaration.

    Remove it. No functional change.

    Link: http://lkml.kernel.org/r/20190213113736.21922-1-namit@vmware.com
    Signed-off-by: Nadav Amit
    Cc: "Eric W. Biederman"
    Cc: Gargi Sharma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nadav Amit
     

21 Jul, 2018

2 commits

  • Everywhere except in the pid array we distinguish between a tasks pid and
    a tasks tgid (thread group id). Even in the enumeration we want that
    distinction sometimes so we have added __PIDTYPE_TGID. With leader_pid
    we almost have an implementation of PIDTYPE_TGID in struct signal_struct.

    Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
    into the pids array. Then remove the __PIDTYPE_TGID special case and the
    leader_pid in signal_struct.

    The net size increase is just an extra pointer added to struct pid and
    an extra pair of pointers of an hlist_node added to task_struct.

    The effect on code maintenance is the removal of a number of special
    cases today and the potential to remove many more special cases as
    PIDTYPE_TGID gets used to it's fullest. The long term potential
    is allowing zombie thread group leaders to exit, which will remove
    a lot more special cases in the code.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • To access these fields the code always has to go to group leader so
    going to signal struct is no loss and is actually a fundamental simplification.

    This saves a little bit of memory by only allocating the pid pointer array
    once instead of once for every thread, and even better this removes a
    few potential races caused by the fact that group_leader can be changed
    by de_thread, while signal_struct can not.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

18 Nov, 2017

1 commit

  • pidhash is no longer required as all the information can be looked up
    from idr tree. nr_hashed represented the number of pids that had been
    hashed. Since, nr_hashed and PIDNS_HASH_ADDING are no longer relevant,
    it has been renamed to pid_allocated and PIDNS_ADDING respectively.

    [gs051095@gmail.com: v6]
    Link: http://lkml.kernel.org/r/1507760379-21662-3-git-send-email-gs051095@gmail.com
    Link: http://lkml.kernel.org/r/1507583624-22146-3-git-send-email-gs051095@gmail.com
    Signed-off-by: Gargi Sharma
    Reviewed-by: Rik van Riel
    Tested-by: Tony Luck [ia64]
    Cc: Julia Lawall
    Cc: Ingo Molnar
    Cc: Pavel Tatashin
    Cc: Kirill Tkhai
    Cc: Oleg Nesterov
    Cc: Eric W. Biederman
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gargi Sharma
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

22 Aug, 2017

1 commit

  • This was reported many times, and this was even mentioned in commit
    52ee2dfdd4f5 ("pids: refactor vnr/nr_ns helpers to make them safe") but
    somehow nobody bothered to fix the obvious problem: task_tgid_nr_ns() is
    not safe because task->group_leader points to nowhere after the exiting
    task passes exit_notify(), rcu_read_lock() can not help.

    We really need to change __unhash_process() to nullify group_leader,
    parent, and real_parent, but this needs some cleanups. Until then we
    can turn task_tgid_nr_ns() into another user of __task_pid_nr_ns() and
    fix the problem.

    Reported-by: Troy Kensinger
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

02 Mar, 2017

1 commit


28 Feb, 2017

1 commit

  • while_each_pid_thread() is using while_each_thread(), which is unsafe
    under RCU lock according to commit 0c740d0afc3b ("introduce
    for_each_thread() to replace the buggy while_each_thread()"). Use
    for_each_thread() in do_each_pid_thread() which is safe under RCU lock.

    Link: http://lkml.kernel.org/r/201702011947.DBD56740.OMVHOLOtSJFFFQ@I-love.SAKURA.ne.jp
    Link: http://lkml.kernel.org/r/1486041779-4401-2-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

04 Jul, 2013

1 commit

  • copy_process() adds the new child to thread_group/init_task.tasks list and
    then does attach_pid(child, PIDTYPE_PID). This means that the lockless
    next_thread() or next_task() can see this thread with the wrong pid. Say,
    "ls /proc/pid/task" can list the same inode twice.

    We could move attach_pid(child, PIDTYPE_PID) up, but in this case
    find_task_by_vpid() can find the new thread before it was fully
    initialized.

    And this is already true for PIDTYPE_PGID/PIDTYPE_SID, With this patch
    copy_process() initializes child->pids[*].pid first, then calls
    attach_pid() to insert the task into the pid->tasks list.

    attach_pid() no longer need the "struct pid*" argument, it is always
    called after pid_link->pid was already set.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Michal Hocko
    Cc: Pavel Emelyanov
    Cc: Sergey Dyasly
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

26 Dec, 2012

1 commit

  • Oleg pointed out that in a pid namespace the sequence.
    - pid 1 becomes a zombie
    - setns(thepidns), fork,...
    - reaping pid 1.
    - The injected processes exiting.

    Can lead to processes attempting access their child reaper and
    instead following a stale pointer.

    That waitpid for init can return before all of the processes in
    the pid namespace have exited is also unfortunate.

    Avoid these problems by disabling the allocation of new pids in a pid
    namespace when init dies, instead of when the last process in a pid
    namespace is reaped.

    Pointed-out-by: Oleg Nesterov
    Reviewed-by: Oleg Nesterov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

27 May, 2011

1 commit


19 Apr, 2011

1 commit

  • next_pidmap() just quietly accepted whatever 'last' pid that was passed
    in, which is not all that safe when one of the users is /proc.

    Admittedly the proc code should do some sanity checking on the range
    (and that will be the next commit), but that doesn't mean that the
    helper functions should just do that pidmap pointer arithmetic without
    checking the range of its arguments.

    So clamp 'last' to PID_MAX_LIMIT. The fact that we then do "last+1"
    doesn't really matter, the for-loop does check against the end of the
    pidmap array properly (it's only the actual pointer arithmetic overflow
    case we need to worry about, and going one bit beyond isn't going to
    overflow).

    [ Use PID_MAX_LIMIT rather than pid_max as per Eric Biederman ]

    Reported-by: Tavis Ormandy
    Analyzed-by: Robert Święcki
    Cc: Eric W. Biederman
    Cc: Pavel Emelyanov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

31 Mar, 2011

1 commit


24 Mar, 2011

1 commit

  • This patchset is a cleanup and a preparation to unshare the pid namespace.
    These prerequisites prepare for Eric's patchset to give a file descriptor
    to a namespace and join an existing namespace.

    This patch:

    It turns out that the existing assignment in copy_process of the
    child_reaper can handle the initial assignment of child_reaper we just
    need to generalize the test in kernel/fork.c

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Daniel Lezcano
    Cc: Oleg Nesterov
    Cc: Alexey Dobriyan
    Acked-by: Serge E. Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

09 Jan, 2009

1 commit

  • A current problem with the pid namespace is that it is easy to do pid
    related work after exit_task_namespaces which drops the nsproxy pointer.

    However if we are doing pid namespace related work we are always operating
    on some struct pid which retains the pid_namespace pointer of the pid
    namespace it was allocated in.

    So provide ns_of_pid which allows us to find the pid namespace a pid was
    allocated in.

    Using this we have the needed infrastructure to do pid namespace related
    work at anytime we have a struct pid, removing the chance of accidentally
    having a NULL pointer dereference when accessing current->nsproxy.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Bastian Blank
    Cc: Pavel Emelyanov
    Cc: Nadia Derbey
    Acked-by: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

04 Dec, 2008

1 commit

  • Impact: macro side-effects fix

    This patch adds parenthesis around 'pid' in the do_each_pid_task
    macro to allow callers to pass in more complex parameters.

    e.g. do_each_pid_task(*pid, type, task)

    Signed-off-by: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     

21 Aug, 2008

1 commit

  • When user calls sys_setpriority(PRIO_PGRP ...) on a NPTL style multi-LWP
    process, only the task leader of the process is affected, all other
    sibling LWP threads didn't receive the setting. The problem was that the
    iterator used in sys_setpriority() only iteartes over one task for each
    process, ignoring all other sibling thread.

    Introduce a new macro do_each_pid_thread / while_each_pid_thread to walk
    each thread of a process. Convert 4 call sites in {set/get}priority and
    ioprio_{set/get}.

    Signed-off-by: Ken Chen
    Cc: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Chen
     

26 Jul, 2008

3 commits

  • It seems to me that it was a mistake marking this function as deprecated
    and scheduling it for removal, rather than resolutely removing it after
    the last caller's death.

    Anyway - better late, then never.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • This one had the only users so far - the kill_proc, which is removed, so
    drop this (invalid in namespaced world) call too.

    And of course - erase all references on it from comments.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • When struct pid is built on a 64 bit platform gcc has to insert padding to
    maintain the correct alignment, by simply reordering its members the
    memory usage shrinks from 88 bytes to 80.

    I've successfully run with this patch on my desktop AMD64 machine.

    There are no significant kernel size changes to a default config.X86_64
    on the latest git v2.6.26-rc1

    text data bss dec hex filename
    5404828 976760 734280 7115868 6c945c vmlinux
    5404811 976760 734280 7115851 6c944b vmlinux.pid-patch

    Acked-by: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Kennedy
     

30 Apr, 2008

2 commits

  • These values represent the nesting level of a namespace and pids living in it,
    and it's always non-negative.

    Turning this from int to unsigned int saves some space in pid.c (11 bytes on
    x86 and 64 on ia64) by letting the compiler optimize the pid_nr_ns a bit.
    E.g. on ia64 this removes the sign extension calls, which compiler adds to
    optimize access to pid->nubers[ns->level].

    Signed-off-by: Pavel Emelyanov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Based on Eric W. Biederman's idea.

    Without tasklist_lock held task_session()/task_pgrp() can return NULL if the
    caller races with setprgp()/setsid() which does detach_pid() + attach_pid().
    This can happen even if task == current.

    Intoduce the new helper, change_pid(), which should be used instead. This way
    the caller always sees the special pid != NULL, either old or new.

    Also change the prototype of attach_pid(), it always returns 0 and nobody
    check the returned value.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

14 Feb, 2008

1 commit


09 Feb, 2008

3 commits

  • There is a window when de_thread() switches the leader and drops
    tasklist_lock. In that window do_each_pid_task(PIDTYPE_PID) finds both new
    and old leaders.

    The problem is pretty much theoretical and probably can be ignored. Currently
    the only users of do_each_pid_task(PIDTYPE_PID) are send_sigio/send_sigurg, so
    they can send the signal to the same process twice.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Davide Libenzi
    Cc: Pavel Emelyanov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • pid_vnr returns the user space pid with respect to the pid namespace the
    struct pid was allocated in. What we want before we return a pid to user
    space is the user space pid with respect to the pid namespace of current.

    pid_vnr is a very nice optimization but because it isn't quite what we want
    it is easy to use pid_vnr at times when we aren't certain the struct pid
    was allocated in our pid namespace.

    Currently this describes at least tiocgpgrp and tiocgsid in ttyio.c the
    parent process reported in the core dumps and the parent process in
    get_signal_to_deliver.

    So unless the performance impact is huge having an interface that does what
    we want instead of always what we want should be much more reliable and
    much less error prone.

    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Acked-by: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Just like with the user namespaces, move the namespace management code into
    the separate .c file and mark the (already existing) PID_NS option as "depend
    on NAMESPACES"

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Pavel Emelyanov
    Acked-by: Serge Hallyn
    Cc: Cedric Le Goater
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

20 Oct, 2007

1 commit

  • The find_pid/_vpid/_pid_ns functions are used to find the struct pid by its
    id, depending on whic id - global or virtual - is used.

    The find_vpid() is a macro that pushes the current->nsproxy->pid_ns on the
    stack to call another function - find_pid_ns(). It turned out, that this
    dereference together with the push itself cause the kernel text size to
    grow too much.

    Move all these out-of-line. Together with the previous patch this saves a
    bit less that 400 bytes from .text section.

    Signed-off-by: Pavel Emelyanov
    Cc: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov