17 Jul, 2019

1 commit

  • struct pid's count is an atomic_t field used as a refcount. Use
    refcount_t for it which is basically atomic_t but does additional
    checking to prevent use-after-free bugs.

    For memory ordering, the only change is with the following:

    - if ((atomic_read(&pid->count) == 1) ||
    - atomic_dec_and_test(&pid->count)) {
    + if (refcount_dec_and_test(&pid->count)) {
    kmem_cache_free(ns->pid_cachep, pid);

    Here the change is from: Fully ordered --> RELEASE + ACQUIRE (as per
    refcount-vs-atomic.rst) This ACQUIRE should take care of making sure the
    free happens after the refcount_dec_and_test().

    The above hunk also removes atomic_read() since it is not needed for the
    code to work and it is unclear how beneficial it is. The removal lets
    refcount_dec_and_test() check for cases where get_pid() happened before
    the object was freed.

    Link: http://lkml.kernel.org/r/20190701183826.191936-1-joel@joelfernandes.org
    Signed-off-by: Joel Fernandes (Google)
    Reviewed-by: Andrea Parri
    Reviewed-by: Kees Cook
    Cc: Mathieu Desnoyers
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Will Deacon
    Cc: Paul E. McKenney
    Cc: Elena Reshetova
    Cc: Jann Horn
    Cc: Eric W. Biederman
    Cc: KJ Tsanaktsidis
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joel Fernandes (Google)
     

11 Jul, 2019

1 commit

  • Pull pidfd updates from Christian Brauner:
    "This adds two main features.

    - First, it adds polling support for pidfds. This allows process
    managers to know when a (non-parent) process dies in a race-free
    way.

    The notification mechanism used follows the same logic that is
    currently used when the parent of a task is notified of a child's
    death. With this patchset it is possible to put pidfds in an
    {e}poll loop and get reliable notifications for process (i.e.
    thread-group) exit.

    - The second feature compliments the first one by making it possible
    to retrieve pollable pidfds for processes that were not created
    using CLONE_PIDFD.

    A lot of processes get created with traditional PID-based calls
    such as fork() or clone() (without CLONE_PIDFD). For these
    processes a caller can currently not create a pollable pidfd. This
    is a problem for Android's low memory killer (LMK) and service
    managers such as systemd.

    Both patchsets are accompanied by selftests.

    It's perhaps worth noting that the work done so far and the work done
    in this branch for pidfd_open() and polling support do already see
    some adoption:

    - Android is in the process of backporting this work to all their LTS
    kernels [1]

    - Service managers make use of pidfd_send_signal but will need to
    wait until we enable waiting on pidfds for full adoption.

    - And projects I maintain make use of both pidfd_send_signal and
    CLONE_PIDFD [2] and will use polling support and pidfd_open() too"

    [1] https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.9+backport%22
    https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.14+backport%22
    https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.19+backport%22

    [2] https://github.com/lxc/lxc/blob/aab6e3eb73c343231cdde775db938994fc6f2803/src/lxc/start.c#L1753

    * tag 'pidfd-updates-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    tests: add pidfd_open() tests
    arch: wire-up pidfd_open()
    pid: add pidfd_open()
    pidfd: add polling selftests
    pidfd: add polling support

    Linus Torvalds
     

28 Jun, 2019

2 commits

  • This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
    pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
    process that is created via traditional fork()/clone() calls that is only
    referenced by a PID:

    int pidfd = pidfd_open(1234, 0);
    ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);

    With the introduction of pidfds through CLONE_PIDFD it is possible to
    created pidfds at process creation time.
    However, a lot of processes get created with traditional PID-based calls
    such as fork() or clone() (without CLONE_PIDFD). For these processes a
    caller can currently not create a pollable pidfd. This is a problem for
    Android's low memory killer (LMK) and service managers such as systemd.
    Both are examples of tools that want to make use of pidfds to get reliable
    notification of process exit for non-parents (pidfd polling) and race-free
    signal sending (pidfd_send_signal()). They intend to switch to this API for
    process supervision/management as soon as possible. Having no way to get
    pollable pidfds from PID-only processes is one of the biggest blockers for
    them in adopting this api. With pidfd_open() making it possible to retrieve
    pidfds for PID-based processes we enable them to adopt this api.

    In line with Arnd's recent changes to consolidate syscall numbers across
    architectures, I have added the pidfd_open() syscall to all architectures
    at the same time.

    Signed-off-by: Christian Brauner
    Reviewed-by: David Howells
    Reviewed-by: Oleg Nesterov
    Acked-by: Arnd Bergmann
    Cc: "Eric W. Biederman"
    Cc: Kees Cook
    Cc: Joel Fernandes (Google)
    Cc: Thomas Gleixner
    Cc: Jann Horn
    Cc: Andy Lutomirsky
    Cc: Andrew Morton
    Cc: Aleksa Sarai
    Cc: Linus Torvalds
    Cc: Al Viro
    Cc: linux-api@vger.kernel.org

    Christian Brauner
     
  • This patch adds polling support to pidfd.

    Android low memory killer (LMK) needs to know when a process dies once
    it is sent the kill signal. It does so by checking for the existence of
    /proc/pid which is both racy and slow. For example, if a PID is reused
    between when LMK sends a kill signal and checks for existence of the
    PID, since the wrong PID is now possibly checked for existence.
    Using the polling support, LMK will be able to get notified when a process
    exists in race-free and fast way, and allows the LMK to do other things
    (such as by polling on other fds) while awaiting the process being killed
    to die.

    For notification to polling processes, we follow the same existing
    mechanism in the kernel used when the parent of the task group is to be
    notified of a child's death (do_notify_parent). This is precisely when the
    tasks waiting on a poll of pidfd are also awakened in this patch.

    We have decided to include the waitqueue in struct pid for the following
    reasons:
    1. The wait queue has to survive for the lifetime of the poll. Including
    it in task_struct would not be option in this case because the task can
    be reaped and destroyed before the poll returns.

    2. By including the struct pid for the waitqueue means that during
    de_thread(), the new thread group leader automatically gets the new
    waitqueue/pid even though its task_struct is different.

    Appropriate test cases are added in the second patch to provide coverage of
    all the cases the patch is handling.

    Cc: Andy Lutomirski
    Cc: Steven Rostedt
    Cc: Daniel Colascione
    Cc: Jann Horn
    Cc: Tim Murray
    Cc: Jonathan Kowalski
    Cc: Linus Torvalds
    Cc: Al Viro
    Cc: Kees Cook
    Cc: David Howells
    Cc: Oleg Nesterov
    Cc: kernel-team@android.com
    Reviewed-by: Oleg Nesterov
    Co-developed-by: Daniel Colascione
    Signed-off-by: Daniel Colascione
    Signed-off-by: Joel Fernandes (Google)
    Signed-off-by: Christian Brauner

    Joel Fernandes (Google)
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

1 commit

  • Hash functions are not needed since idr is used now. Let's remove hash
    header file for cleanup.

    Link: http://lkml.kernel.org/r/20190430053319.95913-1-scuttimmy@gmail.com
    Signed-off-by: Timmy Li
    Cc: "Eric W. Biederman"
    Cc: Michal Hocko
    Cc: Matthew Wilcox
    Cc: Oleg Nesterov
    Cc: Mike Rapoport
    Cc: KJ Tsanaktsidis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Timmy Li
     

29 Dec, 2018

1 commit

  • The failure path removes the allocated PIDs from the wrong namespace.
    This could lead to us inadvertently reusing PIDs in the leaf namespace
    and leaking PIDs in parent namespaces.

    Fixes: 95846ecf9dac ("pid: replace pid bitmap implementation with IDR API")
    Cc:
    Signed-off-by: Matthew Wilcox
    Acked-by: "Eric W. Biederman"
    Reviewed-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

31 Oct, 2018

1 commit

  • Move remaining definitions and declarations from include/linux/bootmem.h
    into include/linux/memblock.h and remove the redundant header.

    The includes were replaced with the semantic patch below and then
    semi-automated removal of duplicated '#include

    @@
    @@
    - #include
    + #include

    [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
    Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
    [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
    Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
    [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
    Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
    Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Stephen Rothwell
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

21 Sep, 2018

1 commit

  • Make the clone and fork syscalls return EAGAIN when the limit on the
    number of pids /proc/sys/kernel/pid_max is exceeded.

    Currently, when the pid_max limit is exceeded, the kernel will return
    ENOSPC from the fork and clone syscalls. This is contrary to the
    documented behaviour, which explicitly calls out the pid_max case as one
    where EAGAIN should be returned. It also leads to really confusing error
    messages in userspace programs which will complain about a lack of disk
    space when they fail to create processes/threads for this reason.

    This error is being returned because alloc_pid() uses the idr api to find
    a new pid; when there are none available, idr_alloc_cyclic() returns
    -ENOSPC, and this is being propagated back to userspace.

    This behaviour has been broken before, and was explicitly fixed in
    commit 35f71bc0a09a ("fork: report pid reservation failure properly"),
    so I think -EAGAIN is definitely the right thing to return in this case.
    The current behaviour change dates from commit 95846ecf9dac ("pid:
    replace pid bitmap implementation with IDR AIP") and was I believe
    unintentional.

    This patch has no impact on the case where allocating a pid fails because
    the child reaper for the namespace is dead; that case will still return
    -ENOMEM.

    Link: http://lkml.kernel.org/r/20180903111016.46461-1-ktsanaktsidis@zendesk.com
    Fixes: 95846ecf9dac ("pid: replace pid bitmap implementation with IDR AIP")
    Signed-off-by: KJ Tsanaktsidis
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Gargi Sharma
    Cc: Rik van Riel
    Cc: Oleg Nesterov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    KJ Tsanaktsidis
     

21 Jul, 2018

3 commits

  • Everywhere except in the pid array we distinguish between a tasks pid and
    a tasks tgid (thread group id). Even in the enumeration we want that
    distinction sometimes so we have added __PIDTYPE_TGID. With leader_pid
    we almost have an implementation of PIDTYPE_TGID in struct signal_struct.

    Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
    into the pids array. Then remove the __PIDTYPE_TGID special case and the
    leader_pid in signal_struct.

    The net size increase is just an extra pointer added to struct pid and
    an extra pair of pointers of an hlist_node added to task_struct.

    The effect on code maintenance is the removal of a number of special
    cases today and the potential to remove many more special cases as
    PIDTYPE_TGID gets used to it's fullest. The long term potential
    is allowing zombie thread group leaders to exit, which will remove
    a lot more special cases in the code.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • To access these fields the code always has to go to group leader so
    going to signal struct is no loss and is actually a fundamental simplification.

    This saves a little bit of memory by only allocating the pid pointer array
    once instead of once for every thread, and even better this removes a
    few potential races caused by the fact that group_leader can be changed
    by de_thread, while signal_struct can not.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • The cost is the the same and this removes the need
    to worry about complications that come from de_thread
    and group_leader changing.

    __task_pid_nr_ns has been updated to take advantage of this change.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

12 Apr, 2018

1 commit

  • This results in no change in structure size on 64-bit machines as it
    fits in the padding between the gfp_t and the void *. 32-bit machines
    will grow the structure from 8 to 12 bytes. Almost all radix trees are
    protected with (at least) a spinlock, so as they are converted from
    radix trees to xarrays, the data structures will shrink again.

    Initialising the spinlock requires a name for the benefit of lockdep, so
    RADIX_TREE_INIT() now needs to know the name of the radix tree it's
    initialising, and so do IDR_INIT() and IDA_INIT().

    Also add the xa_lock() and xa_unlock() family of wrappers to make it
    easier to use the lock. If we could rely on -fplan9-extensions in the
    compiler, we could avoid all of this syntactic sugar, but that wasn't
    added until gcc 4.6.

    Link: http://lkml.kernel.org/r/20180313132639.17387-8-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Jeff Layton
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Ryusuke Konishi
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

07 Feb, 2018

1 commit

  • There are several functions that do find_task_by_vpid() followed by
    get_task_struct(). We can use a helper function instead.

    Link: http://lkml.kernel.org/r/1509602027-11337-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

30 Jan, 2018

1 commit

  • Pull init_task initializer cleanups from David Howells:
    "It doesn't seem useful to have the init_task in a header file rather
    than in a normal source file. We could consolidate init_task handling
    instead and expand out various macros.

    Here's a series of patches that consolidate init_task handling:

    (1) Make THREAD_SIZE available to vmlinux.lds for cris, hexagon and
    openrisc.

    (2) Alter the INIT_TASK_DATA linker script macro to set
    init_thread_union and init_stack rather than defining these in C.

    Insert init_task and init_thread_into into the init_stack area in
    the linker script as appropriate to the configuration, with
    different section markers so that they end up correctly ordered.

    We can then get merge ia64's init_task.c into the main one.

    We then have a bunch of single-use INIT_*() macros that seem only
    to be macros because they used to be used per-arch. We can then
    expand these in place of the user and get rid of a few lines and
    a lot of backslashes.

    (3) Expand INIT_TASK() in place.

    (4) Expand in place various small INIT_*() macros that are defined
    conditionally. Expand them and surround them by #if[n]def/#endif
    in the .c file as it takes fewer lines.

    (5) Expand INIT_SIGNALS() and INIT_SIGHAND() in place.

    (6) Expand INIT_STRUCT_PID in place.

    These macros can then be discarded"

    * tag 'init_task-20180117' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    Expand INIT_STRUCT_PID and remove
    Expand the INIT_SIGNALS and INIT_SIGHAND macros and remove
    Expand various INIT_* macros and remove
    Expand INIT_TASK() in init/init_task.c and remove
    Construct init thread stack in the linker script rather than by union
    openrisc: Make THREAD_SIZE available to vmlinux.lds
    hexagon: Make THREAD_SIZE available to vmlinux.lds
    cris: Make THREAD_SIZE available to vmlinux.lds

    Linus Torvalds
     

17 Jan, 2018

1 commit

  • Expand INIT_STRUCT_PID in the single place that uses it and then remove it.
    There doesn't seem any point in the macro.

    Signed-off-by: David Howells
    Tested-by: Tony Luck
    Tested-by: Will Deacon (arm64)
    Tested-by: Palmer Dabbelt
    Acked-by: Thomas Gleixner

    David Howells
     

24 Dec, 2017

1 commit

  • With the replacement of the pid bitmap and hashtable with an idr in
    alloc_pid started occassionally failing when allocating the first pid
    in a pid namespace. Things were not completely reset resulting in
    the first allocated pid getting the number 2 (not 1). Which
    further resulted in ns->proc_mnt not getting set and eventually
    causing an oops in proc_flush_task.

    Oops: 0000 [#1] SMP
    CPU: 2 PID: 6743 Comm: trinity-c117 Not tainted 4.15.0-rc4-think+ #2
    RIP: 0010:proc_flush_task+0x8e/0x1b0
    RSP: 0018:ffffc9000bbffc40 EFLAGS: 00010286
    RAX: 0000000000000001 RBX: 0000000000000001 RCX: 00000000fffffffb
    RDX: 0000000000000000 RSI: ffffc9000bbffc50 RDI: 0000000000000000
    RBP: ffffc9000bbffc63 R08: 0000000000000000 R09: 0000000000000002
    R10: ffffc9000bbffb70 R11: ffffc9000bbffc64 R12: 0000000000000003
    R13: 0000000000000000 R14: 0000000000000003 R15: ffff8804c10d7840
    FS: 00007f7cb8965700(0000) GS:ffff88050a200000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000000 CR3: 00000003e21ae003 CR4: 00000000001606e0
    DR0: 00007fb1d6c22000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
    Call Trace:
    ? release_task+0xaf/0x680
    release_task+0xd2/0x680
    ? wait_consider_task+0xb82/0xce0
    wait_consider_task+0xbe9/0xce0
    ? do_wait+0xe1/0x330
    do_wait+0x151/0x330
    kernel_wait4+0x8d/0x150
    ? task_stopped_code+0x50/0x50
    SYSC_wait4+0x95/0xa0
    ? rcu_read_lock_sched_held+0x6c/0x80
    ? syscall_trace_enter+0x2d7/0x340
    ? do_syscall_64+0x60/0x210
    do_syscall_64+0x60/0x210
    entry_SYSCALL64_slow_path+0x25/0x25
    RIP: 0033:0x7f7cb82603aa
    RSP: 002b:00007ffd60770bc8 EFLAGS: 00000246
    ORIG_RAX: 000000000000003d
    RAX: ffffffffffffffda RBX: 00007f7cb6cd4000 RCX: 00007f7cb82603aa
    RDX: 000000000000000b RSI: 00007ffd60770bd0 RDI: 0000000000007cca
    RBP: 0000000000007cca R08: 00007f7cb8965700 R09: 00007ffd607c7080
    R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
    R13: 00007ffd60770bd0 R14: 00007f7cb6cd4058 R15: 00000000cccccccd
    Code: c1 e2 04 44 8b 60 30 48 8b 40 38 44 8b 34 11 48 c7 c2 60 3a f5 81 44 89 e1 4c 8b 68 58 e8 4b b4 77 00 89 44 24 14 48 8d 74 24 10 8b 7d 00 e8 b9 6a f9 ff 48 85 c0 74 1a 48 89 c7 48 89 44 24
    RIP: proc_flush_task+0x8e/0x1b0 RSP: ffffc9000bbffc40
    CR2: 0000000000000000
    ---[ end trace 53d67a6481059862 ]---

    Improve the quality of the implementation by resetting the place to
    start allocating pids on failure to allocate the first pid.

    As improving the quality of the implementation is the goal remove the now
    unnecesarry disable_pid_allocations call when we fail to mount proc.

    Fixes: 95846ecf9dac ("pid: replace pid bitmap implementation with IDR API")
    Fixes: 8ef047aaaeb8 ("pid namespaces: make alloc_pid(), free_pid() and put_pid() work with struct upid")
    Reported-by: Dave Jones
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

18 Nov, 2017

2 commits

  • pidhash is no longer required as all the information can be looked up
    from idr tree. nr_hashed represented the number of pids that had been
    hashed. Since, nr_hashed and PIDNS_HASH_ADDING are no longer relevant,
    it has been renamed to pid_allocated and PIDNS_ADDING respectively.

    [gs051095@gmail.com: v6]
    Link: http://lkml.kernel.org/r/1507760379-21662-3-git-send-email-gs051095@gmail.com
    Link: http://lkml.kernel.org/r/1507583624-22146-3-git-send-email-gs051095@gmail.com
    Signed-off-by: Gargi Sharma
    Reviewed-by: Rik van Riel
    Tested-by: Tony Luck [ia64]
    Cc: Julia Lawall
    Cc: Ingo Molnar
    Cc: Pavel Tatashin
    Cc: Kirill Tkhai
    Cc: Oleg Nesterov
    Cc: Eric W. Biederman
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gargi Sharma
     
  • Patch series "Replacing PID bitmap implementation with IDR API", v4.

    This series replaces kernel bitmap implementation of PID allocation with
    IDR API. These patches are written to simplify the kernel by replacing
    custom code with calls to generic code.

    The following are the stats for pid and pid_namespace object files
    before and after the replacement. There is a noteworthy change between
    the IDR and bitmap implementation.

    Before
    text data bss dec hex filename
    8447 3894 64 12405 3075 kernel/pid.o
    After
    text data bss dec hex filename
    3397 304 0 3701 e75 kernel/pid.o

    Before
    text data bss dec hex filename
    5692 1842 192 7726 1e2e kernel/pid_namespace.o
    After
    text data bss dec hex filename
    2854 216 16 3086 c0e kernel/pid_namespace.o

    The following are the stats for ps, pstree and calling readdir on /proc
    for 10,000 processes.

    ps:
    With IDR API With bitmap
    real 0m1.479s 0m2.319s
    user 0m0.070s 0m0.060s
    sys 0m0.289s 0m0.516s

    pstree:
    With IDR API With bitmap
    real 0m1.024s 0m1.794s
    user 0m0.348s 0m0.612s
    sys 0m0.184s 0m0.264s

    proc:
    With IDR API With bitmap
    real 0m0.059s 0m0.074s
    user 0m0.000s 0m0.004s
    sys 0m0.016s 0m0.016s

    This patch (of 2):

    Replace the current bitmap implementation for Process ID allocation.
    Functions that are no longer required, for example, free_pidmap(),
    alloc_pidmap(), etc. are removed. The rest of the functions are
    modified to use the IDR API. The change was made to make the PID
    allocation less complex by replacing custom code with calls to generic
    API.

    [gs051095@gmail.com: v6]
    Link: http://lkml.kernel.org/r/1507760379-21662-2-git-send-email-gs051095@gmail.com
    [avagin@openvz.org: restore the old behaviour of the ns_last_pid sysctl]
    Link: http://lkml.kernel.org/r/20171106183144.16368-1-avagin@openvz.org
    Link: http://lkml.kernel.org/r/1507583624-22146-2-git-send-email-gs051095@gmail.com
    Signed-off-by: Gargi Sharma
    Reviewed-by: Rik van Riel
    Acked-by: Oleg Nesterov
    Cc: Julia Lawall
    Cc: Ingo Molnar
    Cc: Pavel Tatashin
    Cc: Kirill Tkhai
    Cc: Eric W. Biederman
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gargi Sharma
     

22 Aug, 2017

1 commit

  • This was reported many times, and this was even mentioned in commit
    52ee2dfdd4f5 ("pids: refactor vnr/nr_ns helpers to make them safe") but
    somehow nobody bothered to fix the obvious problem: task_tgid_nr_ns() is
    not safe because task->group_leader points to nowhere after the exiting
    task passes exit_notify(), rcu_read_lock() can not help.

    We really need to change __unhash_process() to nullify group_leader,
    parent, and real_parent, but this needs some cleanups. Until then we
    can turn task_tgid_nr_ns() into another user of __task_pid_nr_ns() and
    fix the problem.

    Reported-by: Troy Kensinger
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

03 Aug, 2017

1 commit

  • After commit 3d375d78593c ("mm: update callers to use HASH_ZERO flag"),
    drop unused pidhash_size in pidhash_init().

    Link: http://lkml.kernel.org/r/1500389267-49222-1-git-send-email-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang
    Reviewed-by: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kefeng Wang
     

07 Jul, 2017

1 commit

  • Update dcache, inode, pid, mountpoint, and mount hash tables to use
    HASH_ZERO, and remove initialization after allocations. In case of
    places where HASH_EARLY was used such as in __pv_init_lock_hash the
    zeroed hash table was already assumed, because memblock zeroes the
    memory.

    CPU: SPARC M6, Memory: 7T
    Before fix:
    Dentry cache hash table entries: 1073741824
    Inode-cache hash table entries: 536870912
    Mount-cache hash table entries: 16777216
    Mountpoint-cache hash table entries: 16777216
    ftrace: allocating 20414 entries in 40 pages
    Total time: 11.798s

    After fix:
    Dentry cache hash table entries: 1073741824
    Inode-cache hash table entries: 536870912
    Mount-cache hash table entries: 16777216
    Mountpoint-cache hash table entries: 16777216
    ftrace: allocating 20414 entries in 40 pages
    Total time: 3.198s

    CPU: Intel Xeon E5-2630, Memory: 2.2T:
    Before fix:
    Dentry cache hash table entries: 536870912
    Inode-cache hash table entries: 268435456
    Mount-cache hash table entries: 8388608
    Mountpoint-cache hash table entries: 8388608
    CPU: Physical Processor ID: 0
    Total time: 3.245s

    After fix:
    Dentry cache hash table entries: 536870912
    Inode-cache hash table entries: 268435456
    Mount-cache hash table entries: 8388608
    Mountpoint-cache hash table entries: 8388608
    CPU: Physical Processor ID: 0
    Total time: 3.244s

    Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.com
    Signed-off-by: Pavel Tatashin
    Reviewed-by: Babu Moger
    Cc: David Miller
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     

09 May, 2017

1 commit

  • alloc_pidmap() advances pid_namespace::last_pid. When first pid
    allocation fails, then next created process will have pid 2 and
    pid_ns_prepare_proc() won't be called. So, pid_namespace::proc_mnt will
    never be initialized (not to mention that there won't be a child
    reaper).

    I saw crash stack of such case on kernel 3.10:

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: proc_flush_task+0x8f/0x1b0
    Call Trace:
    release_task+0x3f/0x490
    wait_consider_task.part.10+0x7ff/0xb00
    do_wait+0x11f/0x280
    SyS_wait4+0x7d/0x110

    We may fix this by restore of last_pid in 0 or by prohibiting of futher
    allocations. Since there was a similar issue in Oleg Nesterov's commit
    314a8ad0f18a ("pidns: fix free_pid() to handle the first fork failure").
    and it was fixed via prohibiting allocation, let's follow this way, and
    do the same.

    Link: http://lkml.kernel.org/r/149201021004.4863.6762095011554287922.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Cyrill Gorcunov
    Cc: Andrei Vagin
    Cc: Andreas Gruenbacher
    Cc: Kees Cook
    Cc: Michael Kerrisk
    Cc: Al Viro
    Cc: Oleg Nesterov
    Cc: Paul Moore
    Cc: Eric Biederman
    Cc: Andy Lutomirski
    Cc: Ingo Molnar
    Cc: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     

02 Mar, 2017

1 commit


14 Jan, 2017

1 commit

  • Since we need to change the implementation, stop exposing internals.

    Provide KREF_INIT() to allow static initialization of struct kref.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Greg Kroah-Hartman
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

28 May, 2016

1 commit

  • Most users of IS_ERR_VALUE() in the kernel are wrong, as they
    pass an 'int' into a function that takes an 'unsigned long'
    argument. This happens to work because the type is sign-extended
    on 64-bit architectures before it gets converted into an
    unsigned type.

    However, anything that passes an 'unsigned short' or 'unsigned int'
    argument into IS_ERR_VALUE() is guaranteed to be broken, as are
    8-bit integers and types that are wider than 'unsigned long'.

    Andrzej Hajda has already fixed a lot of the worst abusers that
    were causing actual bugs, but it would be nice to prevent any
    users that are not passing 'unsigned long' arguments.

    This patch changes all users of IS_ERR_VALUE() that I could find
    on 32-bit ARM randconfig builds and x86 allmodconfig. For the
    moment, this doesn't change the definition of IS_ERR_VALUE()
    because there are probably still architecture specific users
    elsewhere.

    Almost all the warnings I got are for files that are better off
    using 'if (err)' or 'if (err < 0)'.
    The only legitimate user I could find that we get a warning for
    is the (32-bit only) freescale fman driver, so I did not remove
    the IS_ERR_VALUE() there but changed the type to 'unsigned long'.
    For 9pfs, I just worked around one user whose calling conventions
    are so obscure that I did not dare change the behavior.

    I was using this definition for testing:

    #define IS_ERR_VALUE(x) ((unsigned long*)NULL == (typeof (x)*)NULL && \
    unlikely((unsigned long long)(x) >= (unsigned long long)(typeof(x))-MAX_ERRNO))

    which ends up making all 16-bit or wider types work correctly with
    the most plausible interpretation of what IS_ERR_VALUE() was supposed
    to return according to its users, but also causes a compile-time
    warning for any users that do not pass an 'unsigned long' argument.

    I suggested this approach earlier this year, but back then we ended
    up deciding to just fix the users that are obviously broken. After
    the initial warning that caused me to get involved in the discussion
    (fs/gfs2/dir.c) showed up again in the mainline kernel, Linus
    asked me to send the whole thing again.

    [ Updated the 9p parts as per Al Viro - Linus ]

    Signed-off-by: Arnd Bergmann
    Cc: Andrzej Hajda
    Cc: Andrew Morton
    Link: https://lkml.org/lkml/2016/1/7/363
    Link: https://lkml.org/lkml/2016/5/27/486
    Acked-by: Srinivas Kandagatla # For nvmem part
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

01 Feb, 2016

1 commit


30 Jan, 2016

1 commit

  • Accidentally discovered this typo when I studied this module.

    Signed-off-by: Zhen Lei
    Cc: Hanjun Guo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tianhong Ding
    Cc: Xinwei Hu
    Cc: Zefan Li
    Link: http://lkml.kernel.org/r/1454119457-11272-1-git-send-email-thunder.leizhen@huawei.com
    Signed-off-by: Ingo Molnar

    Zhen Lei
     

15 Jan, 2016

1 commit

  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

25 Nov, 2015

1 commit

  • I got a crash during a "perf top" session that was caused by a race in
    __task_pid_nr_ns() :

    pid_nr_ns() was inlined, but apparently compiler chose to read
    task->pids[type].pid twice, and the pid->level dereference crashed
    because we got a NULL pointer at the second read :

    if (pid && ns->level level) { // CRASH

    Just use RCU API properly to solve this race, and not worry about "perf
    top" crashing hosts :(

    get_task_pid() can benefit from same fix.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

23 Jul, 2015

1 commit


17 Apr, 2015

1 commit

  • copy_process will report any failure in alloc_pid as ENOMEM currently
    which is misleading because the pid allocation might fail not only when
    the memory is short but also when the pid space is consumed already.

    The current man page even mentions this case:

    : EAGAIN
    :
    : A system-imposed limit on the number of threads was encountered.
    : There are a number of limits that may trigger this error: the
    : RLIMIT_NPROC soft resource limit (set via setrlimit(2)), which
    : limits the number of processes and threads for a real user ID, was
    : reached; the kernel's system-wide limit on the number of processes
    : and threads, /proc/sys/kernel/threads-max, was reached (see
    : proc(5)); or the maximum number of PIDs, /proc/sys/kernel/pid_max,
    : was reached (see proc(5)).

    so the current behavior is also incorrect wrt. documentation. POSIX man
    page also suggest returing EAGAIN when the process count limit is reached.

    This patch simply propagates error code from alloc_pid and makes sure we
    return -EAGAIN due to reservation failure. This will make behavior of
    fork closer to both our documentation and POSIX.

    alloc_pid might alsoo fail when the reaper in the pid namespace is dead
    (the namespace basically disallows all new processes) and there is no
    good error code which would match documented ones. We have traditionally
    returned ENOMEM for this case which is misleading as well but as per
    Eric W. Biederman this behavior is documented in man pid_namespaces(7)

    : If the "init" process of a PID namespace terminates, the kernel
    : terminates all of the processes in the namespace via a SIGKILL signal.
    : This behavior reflects the fact that the "init" process is essential for
    : the correct operation of a PID namespace. In this case, a subsequent
    : fork(2) into this PID namespace will fail with the error ENOMEM; it is
    : not possible to create a new processes in a PID namespace whose "init"
    : process has terminated.

    and introducing a new error code would be too risky so let's stick to
    ENOMEM for this case.

    Signed-off-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

17 Dec, 2014

1 commit

  • Pull vfs pile #2 from Al Viro:
    "Next pile (and there'll be one or two more).

    The large piece in this one is getting rid of /proc/*/ns/* weirdness;
    among other things, it allows to (finally) make nameidata completely
    opaque outside of fs/namei.c, making for easier further cleanups in
    there"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    coda_venus_readdir(): use file_inode()
    fs/namei.c: fold link_path_walk() call into path_init()
    path_init(): don't bother with LOOKUP_PARENT in argument
    fs/namei.c: new helper (path_cleanup())
    path_init(): store the "base" pointer to file in nameidata itself
    make default ->i_fop have ->open() fail with ENXIO
    make nameidata completely opaque outside of fs/namei.c
    kill proc_ns completely
    take the targets of /proc/*/ns/* symlinks to separate fs
    bury struct proc_ns in fs/proc
    copy address of proc_ns_ops into ns_common
    new helpers: ns_alloc_inum/ns_free_inum
    make proc_ns_operations work with struct ns_common * instead of void *
    switch the rest of proc_ns_operations to working with &...->ns
    netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
    make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
    common object embedded into various struct ....ns

    Linus Torvalds
     

11 Dec, 2014

1 commit

  • alloc_pid() does get_pid_ns() beforehand but forgets to put_pid_ns() if it
    fails because disable_pid_allocation() was called by the exiting
    child_reaper.

    We could simply move get_pid_ns() down to successful return, but this fix
    tries to be as trivial as possible.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: "Eric W. Biederman"
    Cc: Aaron Tomlin
    Cc: Pavel Emelyanov
    Cc: Serge Hallyn
    Cc: Sterling Alexander
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

05 Dec, 2014

2 commits


01 Oct, 2013

1 commit

  • "case 0" in free_pid() assumes that disable_pid_allocation() should
    clear PIDNS_HASH_ADDING before the last pid goes away.

    However this doesn't happen if the first fork() fails to create the
    child reaper which should call disable_pid_allocation().

    Signed-off-by: Oleg Nesterov
    Reviewed-by: "Eric W. Biederman"
    Cc: "Serge E. Hallyn"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

31 Aug, 2013

1 commit

  • Serge Hallyn writes:

    > Since commit af4b8a83add95ef40716401395b44a1b579965f4 it's been
    > possible to get into a situation where a pidns reaper is
    > , reparented to host pid 1, but never reaped. How to
    > reproduce this is documented at
    >
    > https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1168526
    > (and see
    > https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1168526/comments/13)
    > In short, run repeated starts of a container whose init is
    >
    > Process.exit(0);
    >
    > sysrq-t when such a task is playing zombie shows:
    >
    > [ 131.132978] init x ffff88011fc14580 0 2084 2039 0x00000000
    > [ 131.132978] ffff880116e89ea8 0000000000000002 ffff880116e89fd8 0000000000014580
    > [ 131.132978] ffff880116e89fd8 0000000000014580 ffff8801172a0000 ffff8801172a0000
    > [ 131.132978] ffff8801172a0630 ffff88011729fff0 ffff880116e14650 ffff88011729fff0
    > [ 131.132978] Call Trace:
    > [ 131.132978] [] schedule+0x29/0x70
    > [ 131.132978] [] do_exit+0x6e1/0xa40
    > [ 131.132978] [] ? signal_wake_up_state+0x1e/0x30
    > [ 131.132978] [] do_group_exit+0x3f/0xa0
    > [ 131.132978] [] SyS_exit_group+0x14/0x20
    > [ 131.132978] [] tracesys+0xe1/0xe6
    >
    > Further debugging showed that every time this happened, zap_pid_ns_processes()
    > started with nr_hashed being 3, while we were expecting it to drop to 2.
    > Any time it didn't happen, nr_hashed was 1 or 2. So the reaper was
    > waiting for nr_hashed to become 2, but free_pid() only wakes the reaper
    > if nr_hashed hits 1.

    The issue is that when the task group leader of an init process exits
    before other tasks of the init process when the init process finally
    exits it will be a secondary task sleeping in zap_pid_ns_processes and
    waiting to wake up when the number of hashed pids drops to two. This
    case waits forever as free_pid only sends a wake up when the number of
    hashed pids drops to 1.

    To correct this the simple strategy of sending a possibly unncessary
    wake up when the number of hashed pids drops to 2 is adopted.

    Sending one extraneous wake up is relatively harmless, at worst we
    waste a little cpu time in the rare case when a pid namespace
    appropaches exiting.

    We can detect the case when the pid namespace drops to just two pids
    hashed race free in free_pid.

    Dereferencing pid_ns->child_reaper with the pidmap_lock held is safe
    without out the tasklist_lock because it is guaranteed that the
    detach_pid will be called on the child_reaper before it is freed and
    detach_pid calls __change_pid which calls free_pid which takes the
    pidmap_lock. __change_pid only calls free_pid if this is the
    last use of the pid. For a thread that is not the thread group leader
    the threads pid will only ever have one user because a threads pid
    is not allowed to be the pid of a process, of a process group or
    a session. For a thread that is a thread group leader all of
    the other threads of that process will be reaped before it is allowed
    for the thread group leader to be reaped ensuring there will only
    be one user of the threads pid as a process pid. Furthermore
    because the thread is the init process of a pid namespace all of the
    other processes in the pid namespace will have also been already freed
    leading to the fact that the pid will not be used as a session pid or
    a process group pid for any other running process.

    CC: stable@vger.kernel.org
    Acked-by: Serge Hallyn
    Tested-by: Serge Hallyn
    Reported-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

04 Jul, 2013

2 commits

  • Move statement to static initilization of init_pid_ns.

    Signed-off-by: Raphael S. Carvalho
    Cc: "Eric W. Biederman"
    Acked-by: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Raphael S. Carvalho
     
  • copy_process() adds the new child to thread_group/init_task.tasks list and
    then does attach_pid(child, PIDTYPE_PID). This means that the lockless
    next_thread() or next_task() can see this thread with the wrong pid. Say,
    "ls /proc/pid/task" can list the same inode twice.

    We could move attach_pid(child, PIDTYPE_PID) up, but in this case
    find_task_by_vpid() can find the new thread before it was fully
    initialized.

    And this is already true for PIDTYPE_PGID/PIDTYPE_SID, With this patch
    copy_process() initializes child->pids[*].pid first, then calls
    attach_pid() to insert the task into the pid->tasks list.

    attach_pid() no longer need the "struct pid*" argument, it is always
    called after pid_link->pid was already set.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Michal Hocko
    Cc: Pavel Emelyanov
    Cc: Sergey Dyasly
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov