13 Jan, 2019

1 commit

  • commit 1a80dade010c7a7f4885a4c4c2a7ac22cc7b34df upstream.

    The failure path removes the allocated PIDs from the wrong namespace.
    This could lead to us inadvertently reusing PIDs in the leaf namespace
    and leaking PIDs in parent namespaces.

    Fixes: 95846ecf9dac ("pid: replace pid bitmap implementation with IDR API")
    Cc:
    Signed-off-by: Matthew Wilcox
    Acked-by: "Eric W. Biederman"
    Reviewed-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Matthew Wilcox
     

21 Sep, 2018

1 commit

  • Make the clone and fork syscalls return EAGAIN when the limit on the
    number of pids /proc/sys/kernel/pid_max is exceeded.

    Currently, when the pid_max limit is exceeded, the kernel will return
    ENOSPC from the fork and clone syscalls. This is contrary to the
    documented behaviour, which explicitly calls out the pid_max case as one
    where EAGAIN should be returned. It also leads to really confusing error
    messages in userspace programs which will complain about a lack of disk
    space when they fail to create processes/threads for this reason.

    This error is being returned because alloc_pid() uses the idr api to find
    a new pid; when there are none available, idr_alloc_cyclic() returns
    -ENOSPC, and this is being propagated back to userspace.

    This behaviour has been broken before, and was explicitly fixed in
    commit 35f71bc0a09a ("fork: report pid reservation failure properly"),
    so I think -EAGAIN is definitely the right thing to return in this case.
    The current behaviour change dates from commit 95846ecf9dac ("pid:
    replace pid bitmap implementation with IDR AIP") and was I believe
    unintentional.

    This patch has no impact on the case where allocating a pid fails because
    the child reaper for the namespace is dead; that case will still return
    -ENOMEM.

    Link: http://lkml.kernel.org/r/20180903111016.46461-1-ktsanaktsidis@zendesk.com
    Fixes: 95846ecf9dac ("pid: replace pid bitmap implementation with IDR AIP")
    Signed-off-by: KJ Tsanaktsidis
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Gargi Sharma
    Cc: Rik van Riel
    Cc: Oleg Nesterov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    KJ Tsanaktsidis
     

21 Jul, 2018

3 commits

  • Everywhere except in the pid array we distinguish between a tasks pid and
    a tasks tgid (thread group id). Even in the enumeration we want that
    distinction sometimes so we have added __PIDTYPE_TGID. With leader_pid
    we almost have an implementation of PIDTYPE_TGID in struct signal_struct.

    Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
    into the pids array. Then remove the __PIDTYPE_TGID special case and the
    leader_pid in signal_struct.

    The net size increase is just an extra pointer added to struct pid and
    an extra pair of pointers of an hlist_node added to task_struct.

    The effect on code maintenance is the removal of a number of special
    cases today and the potential to remove many more special cases as
    PIDTYPE_TGID gets used to it's fullest. The long term potential
    is allowing zombie thread group leaders to exit, which will remove
    a lot more special cases in the code.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • To access these fields the code always has to go to group leader so
    going to signal struct is no loss and is actually a fundamental simplification.

    This saves a little bit of memory by only allocating the pid pointer array
    once instead of once for every thread, and even better this removes a
    few potential races caused by the fact that group_leader can be changed
    by de_thread, while signal_struct can not.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • The cost is the the same and this removes the need
    to worry about complications that come from de_thread
    and group_leader changing.

    __task_pid_nr_ns has been updated to take advantage of this change.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

12 Apr, 2018

1 commit

  • This results in no change in structure size on 64-bit machines as it
    fits in the padding between the gfp_t and the void *. 32-bit machines
    will grow the structure from 8 to 12 bytes. Almost all radix trees are
    protected with (at least) a spinlock, so as they are converted from
    radix trees to xarrays, the data structures will shrink again.

    Initialising the spinlock requires a name for the benefit of lockdep, so
    RADIX_TREE_INIT() now needs to know the name of the radix tree it's
    initialising, and so do IDR_INIT() and IDA_INIT().

    Also add the xa_lock() and xa_unlock() family of wrappers to make it
    easier to use the lock. If we could rely on -fplan9-extensions in the
    compiler, we could avoid all of this syntactic sugar, but that wasn't
    added until gcc 4.6.

    Link: http://lkml.kernel.org/r/20180313132639.17387-8-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Jeff Layton
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Ryusuke Konishi
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

07 Feb, 2018

1 commit

  • There are several functions that do find_task_by_vpid() followed by
    get_task_struct(). We can use a helper function instead.

    Link: http://lkml.kernel.org/r/1509602027-11337-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

30 Jan, 2018

1 commit

  • Pull init_task initializer cleanups from David Howells:
    "It doesn't seem useful to have the init_task in a header file rather
    than in a normal source file. We could consolidate init_task handling
    instead and expand out various macros.

    Here's a series of patches that consolidate init_task handling:

    (1) Make THREAD_SIZE available to vmlinux.lds for cris, hexagon and
    openrisc.

    (2) Alter the INIT_TASK_DATA linker script macro to set
    init_thread_union and init_stack rather than defining these in C.

    Insert init_task and init_thread_into into the init_stack area in
    the linker script as appropriate to the configuration, with
    different section markers so that they end up correctly ordered.

    We can then get merge ia64's init_task.c into the main one.

    We then have a bunch of single-use INIT_*() macros that seem only
    to be macros because they used to be used per-arch. We can then
    expand these in place of the user and get rid of a few lines and
    a lot of backslashes.

    (3) Expand INIT_TASK() in place.

    (4) Expand in place various small INIT_*() macros that are defined
    conditionally. Expand them and surround them by #if[n]def/#endif
    in the .c file as it takes fewer lines.

    (5) Expand INIT_SIGNALS() and INIT_SIGHAND() in place.

    (6) Expand INIT_STRUCT_PID in place.

    These macros can then be discarded"

    * tag 'init_task-20180117' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    Expand INIT_STRUCT_PID and remove
    Expand the INIT_SIGNALS and INIT_SIGHAND macros and remove
    Expand various INIT_* macros and remove
    Expand INIT_TASK() in init/init_task.c and remove
    Construct init thread stack in the linker script rather than by union
    openrisc: Make THREAD_SIZE available to vmlinux.lds
    hexagon: Make THREAD_SIZE available to vmlinux.lds
    cris: Make THREAD_SIZE available to vmlinux.lds

    Linus Torvalds
     

17 Jan, 2018

1 commit

  • Expand INIT_STRUCT_PID in the single place that uses it and then remove it.
    There doesn't seem any point in the macro.

    Signed-off-by: David Howells
    Tested-by: Tony Luck
    Tested-by: Will Deacon (arm64)
    Tested-by: Palmer Dabbelt
    Acked-by: Thomas Gleixner

    David Howells
     

24 Dec, 2017

1 commit

  • With the replacement of the pid bitmap and hashtable with an idr in
    alloc_pid started occassionally failing when allocating the first pid
    in a pid namespace. Things were not completely reset resulting in
    the first allocated pid getting the number 2 (not 1). Which
    further resulted in ns->proc_mnt not getting set and eventually
    causing an oops in proc_flush_task.

    Oops: 0000 [#1] SMP
    CPU: 2 PID: 6743 Comm: trinity-c117 Not tainted 4.15.0-rc4-think+ #2
    RIP: 0010:proc_flush_task+0x8e/0x1b0
    RSP: 0018:ffffc9000bbffc40 EFLAGS: 00010286
    RAX: 0000000000000001 RBX: 0000000000000001 RCX: 00000000fffffffb
    RDX: 0000000000000000 RSI: ffffc9000bbffc50 RDI: 0000000000000000
    RBP: ffffc9000bbffc63 R08: 0000000000000000 R09: 0000000000000002
    R10: ffffc9000bbffb70 R11: ffffc9000bbffc64 R12: 0000000000000003
    R13: 0000000000000000 R14: 0000000000000003 R15: ffff8804c10d7840
    FS: 00007f7cb8965700(0000) GS:ffff88050a200000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000000 CR3: 00000003e21ae003 CR4: 00000000001606e0
    DR0: 00007fb1d6c22000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
    Call Trace:
    ? release_task+0xaf/0x680
    release_task+0xd2/0x680
    ? wait_consider_task+0xb82/0xce0
    wait_consider_task+0xbe9/0xce0
    ? do_wait+0xe1/0x330
    do_wait+0x151/0x330
    kernel_wait4+0x8d/0x150
    ? task_stopped_code+0x50/0x50
    SYSC_wait4+0x95/0xa0
    ? rcu_read_lock_sched_held+0x6c/0x80
    ? syscall_trace_enter+0x2d7/0x340
    ? do_syscall_64+0x60/0x210
    do_syscall_64+0x60/0x210
    entry_SYSCALL64_slow_path+0x25/0x25
    RIP: 0033:0x7f7cb82603aa
    RSP: 002b:00007ffd60770bc8 EFLAGS: 00000246
    ORIG_RAX: 000000000000003d
    RAX: ffffffffffffffda RBX: 00007f7cb6cd4000 RCX: 00007f7cb82603aa
    RDX: 000000000000000b RSI: 00007ffd60770bd0 RDI: 0000000000007cca
    RBP: 0000000000007cca R08: 00007f7cb8965700 R09: 00007ffd607c7080
    R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
    R13: 00007ffd60770bd0 R14: 00007f7cb6cd4058 R15: 00000000cccccccd
    Code: c1 e2 04 44 8b 60 30 48 8b 40 38 44 8b 34 11 48 c7 c2 60 3a f5 81 44 89 e1 4c 8b 68 58 e8 4b b4 77 00 89 44 24 14 48 8d 74 24 10 8b 7d 00 e8 b9 6a f9 ff 48 85 c0 74 1a 48 89 c7 48 89 44 24
    RIP: proc_flush_task+0x8e/0x1b0 RSP: ffffc9000bbffc40
    CR2: 0000000000000000
    ---[ end trace 53d67a6481059862 ]---

    Improve the quality of the implementation by resetting the place to
    start allocating pids on failure to allocate the first pid.

    As improving the quality of the implementation is the goal remove the now
    unnecesarry disable_pid_allocations call when we fail to mount proc.

    Fixes: 95846ecf9dac ("pid: replace pid bitmap implementation with IDR API")
    Fixes: 8ef047aaaeb8 ("pid namespaces: make alloc_pid(), free_pid() and put_pid() work with struct upid")
    Reported-by: Dave Jones
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

18 Nov, 2017

2 commits

  • pidhash is no longer required as all the information can be looked up
    from idr tree. nr_hashed represented the number of pids that had been
    hashed. Since, nr_hashed and PIDNS_HASH_ADDING are no longer relevant,
    it has been renamed to pid_allocated and PIDNS_ADDING respectively.

    [gs051095@gmail.com: v6]
    Link: http://lkml.kernel.org/r/1507760379-21662-3-git-send-email-gs051095@gmail.com
    Link: http://lkml.kernel.org/r/1507583624-22146-3-git-send-email-gs051095@gmail.com
    Signed-off-by: Gargi Sharma
    Reviewed-by: Rik van Riel
    Tested-by: Tony Luck [ia64]
    Cc: Julia Lawall
    Cc: Ingo Molnar
    Cc: Pavel Tatashin
    Cc: Kirill Tkhai
    Cc: Oleg Nesterov
    Cc: Eric W. Biederman
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gargi Sharma
     
  • Patch series "Replacing PID bitmap implementation with IDR API", v4.

    This series replaces kernel bitmap implementation of PID allocation with
    IDR API. These patches are written to simplify the kernel by replacing
    custom code with calls to generic code.

    The following are the stats for pid and pid_namespace object files
    before and after the replacement. There is a noteworthy change between
    the IDR and bitmap implementation.

    Before
    text data bss dec hex filename
    8447 3894 64 12405 3075 kernel/pid.o
    After
    text data bss dec hex filename
    3397 304 0 3701 e75 kernel/pid.o

    Before
    text data bss dec hex filename
    5692 1842 192 7726 1e2e kernel/pid_namespace.o
    After
    text data bss dec hex filename
    2854 216 16 3086 c0e kernel/pid_namespace.o

    The following are the stats for ps, pstree and calling readdir on /proc
    for 10,000 processes.

    ps:
    With IDR API With bitmap
    real 0m1.479s 0m2.319s
    user 0m0.070s 0m0.060s
    sys 0m0.289s 0m0.516s

    pstree:
    With IDR API With bitmap
    real 0m1.024s 0m1.794s
    user 0m0.348s 0m0.612s
    sys 0m0.184s 0m0.264s

    proc:
    With IDR API With bitmap
    real 0m0.059s 0m0.074s
    user 0m0.000s 0m0.004s
    sys 0m0.016s 0m0.016s

    This patch (of 2):

    Replace the current bitmap implementation for Process ID allocation.
    Functions that are no longer required, for example, free_pidmap(),
    alloc_pidmap(), etc. are removed. The rest of the functions are
    modified to use the IDR API. The change was made to make the PID
    allocation less complex by replacing custom code with calls to generic
    API.

    [gs051095@gmail.com: v6]
    Link: http://lkml.kernel.org/r/1507760379-21662-2-git-send-email-gs051095@gmail.com
    [avagin@openvz.org: restore the old behaviour of the ns_last_pid sysctl]
    Link: http://lkml.kernel.org/r/20171106183144.16368-1-avagin@openvz.org
    Link: http://lkml.kernel.org/r/1507583624-22146-2-git-send-email-gs051095@gmail.com
    Signed-off-by: Gargi Sharma
    Reviewed-by: Rik van Riel
    Acked-by: Oleg Nesterov
    Cc: Julia Lawall
    Cc: Ingo Molnar
    Cc: Pavel Tatashin
    Cc: Kirill Tkhai
    Cc: Eric W. Biederman
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gargi Sharma
     

22 Aug, 2017

1 commit

  • This was reported many times, and this was even mentioned in commit
    52ee2dfdd4f5 ("pids: refactor vnr/nr_ns helpers to make them safe") but
    somehow nobody bothered to fix the obvious problem: task_tgid_nr_ns() is
    not safe because task->group_leader points to nowhere after the exiting
    task passes exit_notify(), rcu_read_lock() can not help.

    We really need to change __unhash_process() to nullify group_leader,
    parent, and real_parent, but this needs some cleanups. Until then we
    can turn task_tgid_nr_ns() into another user of __task_pid_nr_ns() and
    fix the problem.

    Reported-by: Troy Kensinger
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

03 Aug, 2017

1 commit

  • After commit 3d375d78593c ("mm: update callers to use HASH_ZERO flag"),
    drop unused pidhash_size in pidhash_init().

    Link: http://lkml.kernel.org/r/1500389267-49222-1-git-send-email-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang
    Reviewed-by: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kefeng Wang
     

07 Jul, 2017

1 commit

  • Update dcache, inode, pid, mountpoint, and mount hash tables to use
    HASH_ZERO, and remove initialization after allocations. In case of
    places where HASH_EARLY was used such as in __pv_init_lock_hash the
    zeroed hash table was already assumed, because memblock zeroes the
    memory.

    CPU: SPARC M6, Memory: 7T
    Before fix:
    Dentry cache hash table entries: 1073741824
    Inode-cache hash table entries: 536870912
    Mount-cache hash table entries: 16777216
    Mountpoint-cache hash table entries: 16777216
    ftrace: allocating 20414 entries in 40 pages
    Total time: 11.798s

    After fix:
    Dentry cache hash table entries: 1073741824
    Inode-cache hash table entries: 536870912
    Mount-cache hash table entries: 16777216
    Mountpoint-cache hash table entries: 16777216
    ftrace: allocating 20414 entries in 40 pages
    Total time: 3.198s

    CPU: Intel Xeon E5-2630, Memory: 2.2T:
    Before fix:
    Dentry cache hash table entries: 536870912
    Inode-cache hash table entries: 268435456
    Mount-cache hash table entries: 8388608
    Mountpoint-cache hash table entries: 8388608
    CPU: Physical Processor ID: 0
    Total time: 3.245s

    After fix:
    Dentry cache hash table entries: 536870912
    Inode-cache hash table entries: 268435456
    Mount-cache hash table entries: 8388608
    Mountpoint-cache hash table entries: 8388608
    CPU: Physical Processor ID: 0
    Total time: 3.244s

    Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.com
    Signed-off-by: Pavel Tatashin
    Reviewed-by: Babu Moger
    Cc: David Miller
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     

09 May, 2017

1 commit

  • alloc_pidmap() advances pid_namespace::last_pid. When first pid
    allocation fails, then next created process will have pid 2 and
    pid_ns_prepare_proc() won't be called. So, pid_namespace::proc_mnt will
    never be initialized (not to mention that there won't be a child
    reaper).

    I saw crash stack of such case on kernel 3.10:

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: proc_flush_task+0x8f/0x1b0
    Call Trace:
    release_task+0x3f/0x490
    wait_consider_task.part.10+0x7ff/0xb00
    do_wait+0x11f/0x280
    SyS_wait4+0x7d/0x110

    We may fix this by restore of last_pid in 0 or by prohibiting of futher
    allocations. Since there was a similar issue in Oleg Nesterov's commit
    314a8ad0f18a ("pidns: fix free_pid() to handle the first fork failure").
    and it was fixed via prohibiting allocation, let's follow this way, and
    do the same.

    Link: http://lkml.kernel.org/r/149201021004.4863.6762095011554287922.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Cyrill Gorcunov
    Cc: Andrei Vagin
    Cc: Andreas Gruenbacher
    Cc: Kees Cook
    Cc: Michael Kerrisk
    Cc: Al Viro
    Cc: Oleg Nesterov
    Cc: Paul Moore
    Cc: Eric Biederman
    Cc: Andy Lutomirski
    Cc: Ingo Molnar
    Cc: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     

02 Mar, 2017

1 commit


14 Jan, 2017

1 commit

  • Since we need to change the implementation, stop exposing internals.

    Provide KREF_INIT() to allow static initialization of struct kref.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Greg Kroah-Hartman
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

28 May, 2016

1 commit

  • Most users of IS_ERR_VALUE() in the kernel are wrong, as they
    pass an 'int' into a function that takes an 'unsigned long'
    argument. This happens to work because the type is sign-extended
    on 64-bit architectures before it gets converted into an
    unsigned type.

    However, anything that passes an 'unsigned short' or 'unsigned int'
    argument into IS_ERR_VALUE() is guaranteed to be broken, as are
    8-bit integers and types that are wider than 'unsigned long'.

    Andrzej Hajda has already fixed a lot of the worst abusers that
    were causing actual bugs, but it would be nice to prevent any
    users that are not passing 'unsigned long' arguments.

    This patch changes all users of IS_ERR_VALUE() that I could find
    on 32-bit ARM randconfig builds and x86 allmodconfig. For the
    moment, this doesn't change the definition of IS_ERR_VALUE()
    because there are probably still architecture specific users
    elsewhere.

    Almost all the warnings I got are for files that are better off
    using 'if (err)' or 'if (err < 0)'.
    The only legitimate user I could find that we get a warning for
    is the (32-bit only) freescale fman driver, so I did not remove
    the IS_ERR_VALUE() there but changed the type to 'unsigned long'.
    For 9pfs, I just worked around one user whose calling conventions
    are so obscure that I did not dare change the behavior.

    I was using this definition for testing:

    #define IS_ERR_VALUE(x) ((unsigned long*)NULL == (typeof (x)*)NULL && \
    unlikely((unsigned long long)(x) >= (unsigned long long)(typeof(x))-MAX_ERRNO))

    which ends up making all 16-bit or wider types work correctly with
    the most plausible interpretation of what IS_ERR_VALUE() was supposed
    to return according to its users, but also causes a compile-time
    warning for any users that do not pass an 'unsigned long' argument.

    I suggested this approach earlier this year, but back then we ended
    up deciding to just fix the users that are obviously broken. After
    the initial warning that caused me to get involved in the discussion
    (fs/gfs2/dir.c) showed up again in the mainline kernel, Linus
    asked me to send the whole thing again.

    [ Updated the 9p parts as per Al Viro - Linus ]

    Signed-off-by: Arnd Bergmann
    Cc: Andrzej Hajda
    Cc: Andrew Morton
    Link: https://lkml.org/lkml/2016/1/7/363
    Link: https://lkml.org/lkml/2016/5/27/486
    Acked-by: Srinivas Kandagatla # For nvmem part
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

01 Feb, 2016

1 commit


30 Jan, 2016

1 commit

  • Accidentally discovered this typo when I studied this module.

    Signed-off-by: Zhen Lei
    Cc: Hanjun Guo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tianhong Ding
    Cc: Xinwei Hu
    Cc: Zefan Li
    Link: http://lkml.kernel.org/r/1454119457-11272-1-git-send-email-thunder.leizhen@huawei.com
    Signed-off-by: Ingo Molnar

    Zhen Lei
     

15 Jan, 2016

1 commit

  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

25 Nov, 2015

1 commit

  • I got a crash during a "perf top" session that was caused by a race in
    __task_pid_nr_ns() :

    pid_nr_ns() was inlined, but apparently compiler chose to read
    task->pids[type].pid twice, and the pid->level dereference crashed
    because we got a NULL pointer at the second read :

    if (pid && ns->level level) { // CRASH

    Just use RCU API properly to solve this race, and not worry about "perf
    top" crashing hosts :(

    get_task_pid() can benefit from same fix.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

23 Jul, 2015

1 commit


17 Apr, 2015

1 commit

  • copy_process will report any failure in alloc_pid as ENOMEM currently
    which is misleading because the pid allocation might fail not only when
    the memory is short but also when the pid space is consumed already.

    The current man page even mentions this case:

    : EAGAIN
    :
    : A system-imposed limit on the number of threads was encountered.
    : There are a number of limits that may trigger this error: the
    : RLIMIT_NPROC soft resource limit (set via setrlimit(2)), which
    : limits the number of processes and threads for a real user ID, was
    : reached; the kernel's system-wide limit on the number of processes
    : and threads, /proc/sys/kernel/threads-max, was reached (see
    : proc(5)); or the maximum number of PIDs, /proc/sys/kernel/pid_max,
    : was reached (see proc(5)).

    so the current behavior is also incorrect wrt. documentation. POSIX man
    page also suggest returing EAGAIN when the process count limit is reached.

    This patch simply propagates error code from alloc_pid and makes sure we
    return -EAGAIN due to reservation failure. This will make behavior of
    fork closer to both our documentation and POSIX.

    alloc_pid might alsoo fail when the reaper in the pid namespace is dead
    (the namespace basically disallows all new processes) and there is no
    good error code which would match documented ones. We have traditionally
    returned ENOMEM for this case which is misleading as well but as per
    Eric W. Biederman this behavior is documented in man pid_namespaces(7)

    : If the "init" process of a PID namespace terminates, the kernel
    : terminates all of the processes in the namespace via a SIGKILL signal.
    : This behavior reflects the fact that the "init" process is essential for
    : the correct operation of a PID namespace. In this case, a subsequent
    : fork(2) into this PID namespace will fail with the error ENOMEM; it is
    : not possible to create a new processes in a PID namespace whose "init"
    : process has terminated.

    and introducing a new error code would be too risky so let's stick to
    ENOMEM for this case.

    Signed-off-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

17 Dec, 2014

1 commit

  • Pull vfs pile #2 from Al Viro:
    "Next pile (and there'll be one or two more).

    The large piece in this one is getting rid of /proc/*/ns/* weirdness;
    among other things, it allows to (finally) make nameidata completely
    opaque outside of fs/namei.c, making for easier further cleanups in
    there"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    coda_venus_readdir(): use file_inode()
    fs/namei.c: fold link_path_walk() call into path_init()
    path_init(): don't bother with LOOKUP_PARENT in argument
    fs/namei.c: new helper (path_cleanup())
    path_init(): store the "base" pointer to file in nameidata itself
    make default ->i_fop have ->open() fail with ENXIO
    make nameidata completely opaque outside of fs/namei.c
    kill proc_ns completely
    take the targets of /proc/*/ns/* symlinks to separate fs
    bury struct proc_ns in fs/proc
    copy address of proc_ns_ops into ns_common
    new helpers: ns_alloc_inum/ns_free_inum
    make proc_ns_operations work with struct ns_common * instead of void *
    switch the rest of proc_ns_operations to working with &...->ns
    netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
    make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
    common object embedded into various struct ....ns

    Linus Torvalds
     

11 Dec, 2014

1 commit

  • alloc_pid() does get_pid_ns() beforehand but forgets to put_pid_ns() if it
    fails because disable_pid_allocation() was called by the exiting
    child_reaper.

    We could simply move get_pid_ns() down to successful return, but this fix
    tries to be as trivial as possible.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: "Eric W. Biederman"
    Cc: Aaron Tomlin
    Cc: Pavel Emelyanov
    Cc: Serge Hallyn
    Cc: Sterling Alexander
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

05 Dec, 2014

2 commits


01 Oct, 2013

1 commit

  • "case 0" in free_pid() assumes that disable_pid_allocation() should
    clear PIDNS_HASH_ADDING before the last pid goes away.

    However this doesn't happen if the first fork() fails to create the
    child reaper which should call disable_pid_allocation().

    Signed-off-by: Oleg Nesterov
    Reviewed-by: "Eric W. Biederman"
    Cc: "Serge E. Hallyn"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

31 Aug, 2013

1 commit

  • Serge Hallyn writes:

    > Since commit af4b8a83add95ef40716401395b44a1b579965f4 it's been
    > possible to get into a situation where a pidns reaper is
    > , reparented to host pid 1, but never reaped. How to
    > reproduce this is documented at
    >
    > https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1168526
    > (and see
    > https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1168526/comments/13)
    > In short, run repeated starts of a container whose init is
    >
    > Process.exit(0);
    >
    > sysrq-t when such a task is playing zombie shows:
    >
    > [ 131.132978] init x ffff88011fc14580 0 2084 2039 0x00000000
    > [ 131.132978] ffff880116e89ea8 0000000000000002 ffff880116e89fd8 0000000000014580
    > [ 131.132978] ffff880116e89fd8 0000000000014580 ffff8801172a0000 ffff8801172a0000
    > [ 131.132978] ffff8801172a0630 ffff88011729fff0 ffff880116e14650 ffff88011729fff0
    > [ 131.132978] Call Trace:
    > [ 131.132978] [] schedule+0x29/0x70
    > [ 131.132978] [] do_exit+0x6e1/0xa40
    > [ 131.132978] [] ? signal_wake_up_state+0x1e/0x30
    > [ 131.132978] [] do_group_exit+0x3f/0xa0
    > [ 131.132978] [] SyS_exit_group+0x14/0x20
    > [ 131.132978] [] tracesys+0xe1/0xe6
    >
    > Further debugging showed that every time this happened, zap_pid_ns_processes()
    > started with nr_hashed being 3, while we were expecting it to drop to 2.
    > Any time it didn't happen, nr_hashed was 1 or 2. So the reaper was
    > waiting for nr_hashed to become 2, but free_pid() only wakes the reaper
    > if nr_hashed hits 1.

    The issue is that when the task group leader of an init process exits
    before other tasks of the init process when the init process finally
    exits it will be a secondary task sleeping in zap_pid_ns_processes and
    waiting to wake up when the number of hashed pids drops to two. This
    case waits forever as free_pid only sends a wake up when the number of
    hashed pids drops to 1.

    To correct this the simple strategy of sending a possibly unncessary
    wake up when the number of hashed pids drops to 2 is adopted.

    Sending one extraneous wake up is relatively harmless, at worst we
    waste a little cpu time in the rare case when a pid namespace
    appropaches exiting.

    We can detect the case when the pid namespace drops to just two pids
    hashed race free in free_pid.

    Dereferencing pid_ns->child_reaper with the pidmap_lock held is safe
    without out the tasklist_lock because it is guaranteed that the
    detach_pid will be called on the child_reaper before it is freed and
    detach_pid calls __change_pid which calls free_pid which takes the
    pidmap_lock. __change_pid only calls free_pid if this is the
    last use of the pid. For a thread that is not the thread group leader
    the threads pid will only ever have one user because a threads pid
    is not allowed to be the pid of a process, of a process group or
    a session. For a thread that is a thread group leader all of
    the other threads of that process will be reaped before it is allowed
    for the thread group leader to be reaped ensuring there will only
    be one user of the threads pid as a process pid. Furthermore
    because the thread is the init process of a pid namespace all of the
    other processes in the pid namespace will have also been already freed
    leading to the fact that the pid will not be used as a session pid or
    a process group pid for any other running process.

    CC: stable@vger.kernel.org
    Acked-by: Serge Hallyn
    Tested-by: Serge Hallyn
    Reported-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

04 Jul, 2013

2 commits

  • Move statement to static initilization of init_pid_ns.

    Signed-off-by: Raphael S. Carvalho
    Cc: "Eric W. Biederman"
    Acked-by: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Raphael S. Carvalho
     
  • copy_process() adds the new child to thread_group/init_task.tasks list and
    then does attach_pid(child, PIDTYPE_PID). This means that the lockless
    next_thread() or next_task() can see this thread with the wrong pid. Say,
    "ls /proc/pid/task" can list the same inode twice.

    We could move attach_pid(child, PIDTYPE_PID) up, but in this case
    find_task_by_vpid() can find the new thread before it was fully
    initialized.

    And this is already true for PIDTYPE_PGID/PIDTYPE_SID, With this patch
    copy_process() initializes child->pids[*].pid first, then calls
    attach_pid() to insert the task into the pid->tasks list.

    attach_pid() no longer need the "struct pid*" argument, it is always
    called after pid_link->pid was already set.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Michal Hocko
    Cc: Pavel Emelyanov
    Cc: Sergey Dyasly
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

02 May, 2013

2 commits

  • Pull VFS updates from Al Viro,

    Misc cleanups all over the place, mainly wrt /proc interfaces (switch
    create_proc_entry to proc_create(), get rid of the deprecated
    create_proc_read_entry() in favor of using proc_create_data() and
    seq_file etc).

    7kloc removed.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
    don't bother with deferred freeing of fdtables
    proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
    proc: Make the PROC_I() and PDE() macros internal to procfs
    proc: Supply a function to remove a proc entry by PDE
    take cgroup_open() and cpuset_open() to fs/proc/base.c
    ppc: Clean up scanlog
    ppc: Clean up rtas_flash driver somewhat
    hostap: proc: Use remove_proc_subtree()
    drm: proc: Use remove_proc_subtree()
    drm: proc: Use minor->index to label things, not PDE->name
    drm: Constify drm_proc_list[]
    zoran: Don't print proc_dir_entry data in debug
    reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
    proc: Supply an accessor for getting the data from a PDE's parent
    airo: Use remove_proc_subtree()
    rtl8192u: Don't need to save device proc dir PDE
    rtl8187se: Use a dir under /proc/net/r8180/
    proc: Add proc_mkdir_data()
    proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
    proc: Move PDE_NET() to fs/proc/proc_net.c
    ...

    Linus Torvalds
     
  • Split the proc namespace stuff out into linux/proc_ns.h.

    Signed-off-by: David Howells
    cc: netdev@vger.kernel.org
    cc: Serge E. Hallyn
    cc: Eric W. Biederman
    Signed-off-by: Al Viro

    David Howells
     

01 May, 2013

2 commits

  • Move BITS_PER_PAGE from pid_namespace.c to pid_namespace.h, since we can
    simplify the define PID_MAP_ENTRIES by using the BITS_PER_PAGE.

    [akpm@linux-foundation.org: kernel/pid.c:54:1: warning: "BITS_PER_PAGE" redefined]
    Signed-off-by: Raphael S.Carvalho
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Raphael S.Carvalho
     
  • find_next_offset() searches for an available "cleaned bit" in the
    respective pid bitmap (page), so returns the offset if found, otherwise
    it returns a value equals to BITS_PER_PAGE.

    For example, suppose find_next_offset didn't find any available bit, so
    there's no purpose to call mk_pid (Wasteful Cpu Cycles).

    Therefore, I found it could be better to call mk_pid after the checking
    (offset < BITS_PER_PAGE) returned sucessfully! Another point: If (offset
    < BITS_PER_PAGE) results in a "failure", then mk_pid would be called
    again afterwards.

    [akpm@linux-foundation.org: simplify code]
    Signed-off-by: Raphael S. Carvalho
    Cc: "Eric W. Biederman"
    Cc: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Raphael S. Carvalho
     

28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

13 Feb, 2013

1 commit


26 Dec, 2012

1 commit

  • Oleg pointed out that in a pid namespace the sequence.
    - pid 1 becomes a zombie
    - setns(thepidns), fork,...
    - reaping pid 1.
    - The injected processes exiting.

    Can lead to processes attempting access their child reaper and
    instead following a stale pointer.

    That waitpid for init can return before all of the processes in
    the pid namespace have exited is also unfortunate.

    Avoid these problems by disabling the allocation of new pids in a pid
    namespace when init dies, instead of when the last process in a pid
    namespace is reaped.

    Pointed-out-by: Oleg Nesterov
    Reviewed-by: Oleg Nesterov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman