09 Jul, 2019

3 commits

  • Pull scheduler updates from Ingo Molnar:

    - Remove the unused per rq load array and all its infrastructure, by
    Dietmar Eggemann.

    - Add utilization clamping support by Patrick Bellasi. This is a
    refinement of the energy aware scheduling framework with support for
    boosting of interactive and capping of background workloads: to make
    sure critical GUI threads get maximum frequency ASAP, and to make
    sure background processing doesn't unnecessarily move to cpufreq
    governor to higher frequencies and less energy efficient CPU modes.

    - Add the bare minimum of tracepoints required for LISA EAS regression
    testing, by Qais Yousef - which allows automated testing of various
    power management features, including energy aware scheduling.

    - Restructure the former tsk_nr_cpus_allowed() facility that the -rt
    kernel used to modify the scheduler's CPU affinity logic such as
    migrate_disable() - introduce the task->cpus_ptr value instead of
    taking the address of &task->cpus_allowed directly - by Sebastian
    Andrzej Siewior.

    - Misc optimizations, fixes, cleanups and small enhancements - see the
    Git log for details.

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
    sched/uclamp: Add uclamp support to energy_compute()
    sched/uclamp: Add uclamp_util_with()
    sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks
    sched/uclamp: Set default clamps for RT tasks
    sched/uclamp: Reset uclamp values on RESET_ON_FORK
    sched/uclamp: Extend sched_setattr() to support utilization clamping
    sched/core: Allow sched_setattr() to use the current policy
    sched/uclamp: Add system default clamps
    sched/uclamp: Enforce last task's UCLAMP_MAX
    sched/uclamp: Add bucket local max tracking
    sched/uclamp: Add CPU's clamp buckets refcounting
    sched/fair: Rename weighted_cpuload() to cpu_runnable_load()
    sched/debug: Export the newly added tracepoints
    sched/debug: Add sched_overutilized tracepoint
    sched/debug: Add new tracepoint to track PELT at se level
    sched/debug: Add new tracepoints to track PELT at rq level
    sched/debug: Add a new sched_trace_*() helper functions
    sched/autogroup: Make autogroup_path() always available
    sched/wait: Deduplicate code with do-while
    sched/topology: Remove unused 'sd' parameter from arch_scale_cpu_capacity()
    ...

    Linus Torvalds
     
  • Pull locking updates from Ingo Molnar:
    "The main changes in this cycle are:

    - rwsem scalability improvements, phase #2, by Waiman Long, which are
    rather impressive:

    "On a 2-socket 40-core 80-thread Skylake system with 40 reader
    and writer locking threads, the min/mean/max locking operations
    done in a 5-second testing window before the patchset were:

    40 readers, Iterations Min/Mean/Max = 1,807/1,808/1,810
    40 writers, Iterations Min/Mean/Max = 1,807/50,344/151,255

    After the patchset, they became:

    40 readers, Iterations Min/Mean/Max = 30,057/31,359/32,741
    40 writers, Iterations Min/Mean/Max = 94,466/95,845/97,098"

    There's a lot of changes to the locking implementation that makes
    it similar to qrwlock, including owner handoff for more fair
    locking.

    Another microbenchmark shows how across the spectrum the
    improvements are:

    "With a locking microbenchmark running on 5.1 based kernel, the
    total locking rates (in kops/s) on a 2-socket Skylake system
    with equal numbers of readers and writers (mixed) before and
    after this patchset were:

    # of Threads Before Patch After Patch
    ------------ ------------ -----------
    2 2,618 4,193
    4 1,202 3,726
    8 802 3,622
    16 729 3,359
    32 319 2,826
    64 102 2,744"

    The changes are extensive and the patch-set has been through
    several iterations addressing various locking workloads. There
    might be more regressions, but unless they are pathological I
    believe we want to use this new implementation as the baseline
    going forward.

    - jump-label optimizations by Daniel Bristot de Oliveira: the primary
    motivation was to remove IPI disturbance of isolated RT-workload
    CPUs, which resulted in the implementation of batched jump-label
    updates. Beyond the improvement of the real-time characteristics
    kernel, in one test this patchset improved static key update
    overhead from 57 msecs to just 1.4 msecs - which is a nice speedup
    as well.

    - atomic64_t cross-arch type cleanups by Mark Rutland: over the last
    ~10 years of atomic64_t existence the various types used by the
    APIs only had to be self-consistent within each architecture -
    which means they became wildly inconsistent across architectures.
    Mark puts and end to this by reworking all the atomic64
    implementations to use 's64' as the base type for atomic64_t, and
    to ensure that this type is consistently used for parameters and
    return values in the API, avoiding further problems in this area.

    - A large set of small improvements to lockdep by Yuyang Du: type
    cleanups, output cleanups, function return type and othr cleanups
    all around the place.

    - A set of percpu ops cleanups and fixes by Peter Zijlstra.

    - Misc other changes - please see the Git log for more details"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits)
    locking/lockdep: increase size of counters for lockdep statistics
    locking/atomics: Use sed(1) instead of non-standard head(1) option
    locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING
    x86/jump_label: Make tp_vec_nr static
    x86/percpu: Optimize raw_cpu_xchg()
    x86/percpu, sched/fair: Avoid local_clock()
    x86/percpu, x86/irq: Relax {set,get}_irq_regs()
    x86/percpu: Relax smp_processor_id()
    x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
    locking/rwsem: Guard against making count negative
    locking/rwsem: Adaptive disabling of reader optimistic spinning
    locking/rwsem: Enable time-based spinning on reader-owned rwsem
    locking/rwsem: Make rwsem->owner an atomic_long_t
    locking/rwsem: Enable readers spinning on writer
    locking/rwsem: Clarify usage of owner's nonspinaable bit
    locking/rwsem: Wake up almost all readers in wait queue
    locking/rwsem: More optimal RT task handling of null owner
    locking/rwsem: Always release wait_lock before waking up tasks
    locking/rwsem: Implement lock handoff to prevent lock starvation
    locking/rwsem: Make rwsem_spin_on_owner() return owner state
    ...

    Linus Torvalds
     
  • Pull timer updates from Thomas Gleixner:
    "The timer and timekeeping departement delivers:

    Core:

    - The consolidation of the VDSO code into a generic library including
    the conversion of x86 and ARM64. Conversion of ARM and MIPS are en
    route through the relevant maintainer trees and should end up in
    5.4.

    This gets rid of the unnecessary different copies of the same code
    and brings all architectures on the same level of VDSO
    functionality.

    - Make the NTP user space interface more robust by restricting the
    TAI offset to prevent undefined behaviour. Includes a selftest.

    - Validate user input in the compat settimeofday() syscall to catch
    invalid values which would be turned into valid values by a
    multiplication overflow

    - Consolidate the time accessors

    - Small fixes, improvements and cleanups all over the place

    Drivers:

    - Support for the NXP system counter, TI davinci timer

    - Move the Microsoft HyperV clocksource/events code into the
    drivers/clocksource directory so it can be shared between x86 and
    ARM64.

    - Overhaul of the Tegra driver

    - Delay timer support for IXP4xx

    - Small fixes, improvements and cleanups as usual"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
    time: Validate user input in compat_settimeofday()
    timer: Document TIMER_PINNED
    clocksource/drivers: Continue making Hyper-V clocksource ISA agnostic
    clocksource/drivers: Make Hyper-V clocksource ISA agnostic
    MAINTAINERS: Fix Andy's surname and the directory entries of VDSO
    hrtimer: Use a bullet for the returns bullet list
    arm64: vdso: Fix compilation with clang older than 8
    arm64: compat: Fix __arch_get_hw_counter() implementation
    arm64: Fix __arch_get_hw_counter() implementation
    lib/vdso: Make delta calculation work correctly
    MAINTAINERS: Add entry for the generic VDSO library
    arm64: compat: No need for pre-ARMv7 barriers on an ARMv8 system
    arm64: vdso: Remove unnecessary asm-offsets.c definitions
    vdso: Remove superfluous #ifdef __KERNEL__ in vdso/datapage.h
    clocksource/drivers/davinci: Add support for clocksource
    clocksource/drivers/davinci: Add support for clockevents
    clocksource/drivers/tegra: Set up maximum-ticks limit properly
    clocksource/drivers/tegra: Cycles can't be 0
    clocksource/drivers/tegra: Restore base address before cleanup
    clocksource/drivers/tegra: Add verbose definition for 1MHz constant
    ...

    Linus Torvalds
     

01 Jul, 2019

1 commit

  • Make sure to return a proper negative error code from copy_process()
    when anon_inode_getfile() fails with CLONE_PIDFD.
    Otherwise _do_fork() will not detect an error and get_task_pid() will
    operator on a nonsensical pointer:

    R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006dbc2c
    R13: 00007ffc15fbb0ff R14: 00007ff07e47e9c0 R15: 0000000000000000
    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    CPU: 1 PID: 7990 Comm: syz-executor290 Not tainted 5.2.0-rc6+ #9
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
    Google 01/01/2011
    RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
    RIP: 0010:get_task_pid+0xe1/0x210 kernel/pid.c:372
    Code: 89 ff e8 62 27 5f 00 49 8b 07 44 89 f1 4c 8d bc c8 90 01 00 00 eb 0c
    e8 0d fe 25 00 49 81 c7 38 05 00 00 4c 89 f8 48 c1 e8 03 3c 18 00 74
    08 4c 89 ff e8 31 27 5f 00 4d 8b 37 e8 f9 47 12 00
    RSP: 0018:ffff88808a4a7d78 EFLAGS: 00010203
    RAX: 00000000000000a7 RBX: dffffc0000000000 RCX: ffff888088180600
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
    RBP: ffff88808a4a7d90 R08: ffffffff814fb3a8 R09: ffffed1015d66bf8
    R10: ffffed1015d66bf8 R11: 1ffff11015d66bf7 R12: 0000000000041ffc
    R13: 1ffff11011494fbc R14: 0000000000000000 R15: 000000000000053d
    FS: 00007ff07e47e700(0000) GS:ffff8880aeb00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000004b5100 CR3: 0000000094df2000 CR4: 00000000001406e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    _do_fork+0x1b9/0x5f0 kernel/fork.c:2360
    __do_sys_clone kernel/fork.c:2454 [inline]
    __se_sys_clone kernel/fork.c:2448 [inline]
    __x64_sys_clone+0xc1/0xd0 kernel/fork.c:2448
    do_syscall_64+0xfe/0x140 arch/x86/entry/common.c:301
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Link: https://lore.kernel.org/lkml/000000000000e0dc0d058c9e7142@google.com
    Reported-and-tested-by: syzbot+002e636502bc4b64eb5c@syzkaller.appspotmail.com
    Fixes: 6fd2fe494b17 ("copy_process(): don't use ksys_close() on cleanups")
    Cc: Jann Horn
    Cc: Al Viro
    Signed-off-by: Christian Brauner

    Christian Brauner
     

29 Jun, 2019

1 commit

  • Commit 5eed6f1dff87 ("fork,memcg: fix crash in free_thread_stack on
    memcg charge fail") corrected two instances, but there was a third
    instance of this bug.

    Without setting tsk->stack, if memcg_charge_kernel_stack fails, it'll
    execute free_thread_stack() on a dangling pointer.

    Enterprise kernels are compiled with VMAP_STACK=y so this isn't
    critical, but custom VMAP_STACK=n builds should have some performance
    advantage, with the drawback of risking to fail fork because compaction
    didn't succeed. So as long as VMAP_STACK=n is a supported option it's
    worth fixing it upstream.

    Link: http://lkml.kernel.org/r/20190619011450.28048-1-aarcange@redhat.com
    Fixes: 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Rik van Riel
    Acked-by: Roman Gushchin
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

27 Jun, 2019

1 commit

  • anon_inode_getfd() should be used *ONLY* in situations when we are
    guaranteed to be past the last failure point (including copying the
    descriptor number to userland, at that). And ksys_close() should
    not be used for cleanups at all.

    anon_inode_getfile() is there for all nontrivial cases like that.
    Just use that...

    Fixes: b3e583825266 ("clone: add CLONE_PIDFD")
    Signed-off-by: Al Viro
    Reviewed-by: Jann Horn
    Signed-off-by: Christian Brauner

    Al Viro
     

24 Jun, 2019

1 commit

  • Give userspace a cheap and reliable way to tell whether CLONE_PIDFD is
    supported by the kernel or not. The easiest way is to pass an invalid
    file descriptor value in parent_tidptr, perform the syscall and verify
    that parent_tidptr has been changed to a valid file descriptor value.

    CLONE_PIDFD uses parent_tidptr to return pidfds. CLONE_PARENT_SETTID
    will use parent_tidptr to return the tid of the parent. The two flags
    cannot be used together. Old kernels that only support
    CLONE_PARENT_SETTID will not verify the value pointed to by
    parent_tidptr. This behavior is unchanged even with the introduction of
    CLONE_PIDFD.
    However, if CLONE_PIDFD is specified the kernel will currently check the
    value pointed to by parent_tidptr before placing the pidfd in the memory
    pointed to. EINVAL will be returned if the value in parent_tidptr is not
    0.

    If CLONE_PIDFD is supported and fd 0 is closed, then the returned pidfd
    can and likely will be 0 and parent_tidptr will be unchanged. This means
    userspace must either check CLONE_PIDFD support beforehand or check that
    fd 0 is not closed when invoking CLONE_PIDFD.

    The check for pidfd == 0 was introduced during the v5.2 merge window by
    commit b3e583825266 ("clone: add CLONE_PIDFD") to ensure that
    CLONE_PIDFD could be potentially extended by passing in flags through
    the return argument.

    However, that extension would look horrible, and with the upcoming
    introduction of the clone3 syscall in v5.3 there is no need to extend
    legacy clone syscall this way. (Even if it would need to be extended,
    CLONE_DETACHED can be reused with CLONE_PIDFD.)

    So remove the pidfd == 0 check. Userspace that needs to be portable to
    kernels without CLONE_PIDFD support can then be advised to initialize
    pidfd to -1 and check the pidfd value returned by CLONE_PIDFD.

    Fixes: b3e583825266 ("clone: add CLONE_PIDFD")
    Signed-off-by: Dmitry V. Levin
    Signed-off-by: Christian Brauner

    Dmitry V. Levin
     

22 Jun, 2019

1 commit


03 Jun, 2019

2 commits

  • Despite that there is a lockdep_init_task() which does nothing, lockdep
    initiates tasks by assigning lockdep fields and does so inconsistently. Fix
    this by using lockdep_init_task().

    Signed-off-by: Yuyang Du
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bvanassche@acm.org
    Cc: frederic@kernel.org
    Cc: ming.lei@redhat.com
    Cc: will.deacon@arm.com
    Link: https://lkml.kernel.org/r/20190506081939.74287-8-duyuyang@gmail.com
    Signed-off-by: Ingo Molnar

    Yuyang Du
     
  • In commit:

    4b53a3412d66 ("sched/core: Remove the tsk_nr_cpus_allowed() wrapper")

    the tsk_nr_cpus_allowed() wrapper was removed. There was not
    much difference in !RT but in RT we used this to implement
    migrate_disable(). Within a migrate_disable() section the CPU mask is
    restricted to single CPU while the "normal" CPU mask remains untouched.

    As an alternative implementation Ingo suggested to use:

    struct task_struct {
    const cpumask_t *cpus_ptr;
    cpumask_t cpus_mask;
    };
    with
    t->cpus_ptr = &t->cpus_mask;

    In -RT we then can switch the cpus_ptr to:

    t->cpus_ptr = &cpumask_of(task_cpu(p));

    in a migration disabled region. The rules are simple:

    - Code that 'uses' ->cpus_allowed would use the pointer.
    - Code that 'modifies' ->cpus_allowed would use the direct mask.

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Thomas Gleixner
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Link: https://lkml.kernel.org/r/20190423142636.14347-1-bigeasy@linutronix.de
    Signed-off-by: Ingo Molnar

    Sebastian Andrzej Siewior
     

02 Jun, 2019

1 commit

  • Fix build warning,
    kernel/fork.c:125:5: warning: symbol 'max_threads' was not declared. Should it be static?

    Link: http://lkml.kernel.org/r/20190516015118.140561-1-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang
    Reported-by: Hulk Robot
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kefeng Wang
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

2 commits

  • The name clear_all_latency_tracing is misleading, in fact which only
    clear per task's latency_record[], and we do have another function named
    clear_global_latency_tracing which clear the global latency_record[]
    buffer.

    Link: http://lkml.kernel.org/r/20190226114602.16902-1-linf@wangsu.com
    Signed-off-by: Lin Feng
    Cc: Alexey Dobriyan
    Cc: Fabian Frederick
    Cc: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lin Feng
     
  • The task structure is freed while get_mem_cgroup_from_mm() holds
    rcu_read_lock() and dereferences mm->owner.

    get_mem_cgroup_from_mm() failing fork()
    ---- ---
    task = mm->owner
    mm->owner = NULL;
    free(task)
    if (task) *task; /* use after free */

    The fix consists in freeing the task with RCU also in the fork failure
    case, exactly like it always happens for the regular exit(2) path. That
    is enough to make the rcu_read_lock hold in get_mem_cgroup_from_mm()
    (left side above) effective to avoid a use after free when dereferencing
    the task structure.

    An alternate possible fix would be to defer the delivery of the
    userfaultfd contexts to the monitor until after fork() is guaranteed to
    succeed. Such a change would require more changes because it would
    create a strict ordering dependency where the uffd methods would need to
    be called beyond the last potentially failing branch in order to be
    safe. This solution as opposed only adds the dependency to common code
    to set mm->owner to NULL and to free the task struct that was pointed by
    mm->owner with RCU, if fork ends up failing. The userfaultfd methods
    can still be called anywhere during the fork runtime and the monitor
    will keep discarding orphaned "mm" coming from failed forks in userland.

    This race condition couldn't trigger if CONFIG_MEMCG was set =n at build
    time.

    [aarcange@redhat.com: improve changelog, reduce #ifdefs per Michal]
    Link: http://lkml.kernel.org/r/20190429035752.4508-1-aarcange@redhat.com
    Link: http://lkml.kernel.org/r/20190325225636.11635-2-aarcange@redhat.com
    Fixes: 893e26e61d04 ("userfaultfd: non-cooperative: Add fork() event")
    Signed-off-by: Andrea Arcangeli
    Tested-by: zhong jiang
    Reported-by: syzbot+cbb52e396df3e565ab02@syzkaller.appspotmail.com
    Cc: Oleg Nesterov
    Cc: Jann Horn
    Cc: Hugh Dickins
    Cc: Mike Rapoport
    Cc: Mike Kravetz
    Cc: Peter Xu
    Cc: Jason Gunthorpe
    Cc: "Kirill A . Shutemov"
    Cc: Michal Hocko
    Cc: zhong jiang
    Cc: syzbot+cbb52e396df3e565ab02@syzkaller.appspotmail.com
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

10 May, 2019

2 commits

  • Avoid calling cgroup_threadgroup_change_end() without having called
    cgroup_threadgroup_change_begin() first.

    During process creation we need to check whether the cgroup we are in
    allows us to fork. To perform this check the cgroup needs to guard itself
    against threadgroup changes and takes a lock.
    Prior to CLONE_PIDFD the cleanup target "bad_fork_free_pid" would also need
    to call cgroup_threadgroup_change_end() because said lock had already been
    taken.
    However, this is not the case anymore with the addition of CLONE_PIDFD. We
    are now allocating a pidfd before we check whether the cgroup we're in can
    fork and thus prior to taking the lock. So when copy_process() fails at the
    right step it would release a lock we haven't taken.
    This bug is not even very subtle to be honest. It's just not very clear
    from the naming of cgroup_threadgroup_change_{begin,end}() that a lock is
    taken.

    Here's the relevant splat:

    entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
    RIP: 0023:0xf7fec849
    Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 14 24 c3 8b 3c 24 c3 90 90
    90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 5a 59 c3 90
    90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
    RSP: 002b:00000000ffed5a8c EFLAGS: 00000246 ORIG_RAX: 0000000000000078
    RAX: ffffffffffffffda RBX: 0000000000003ffc RCX: 0000000000000000
    RDX: 00000000200005c0 RSI: 0000000000000000 RDI: 0000000000000000
    RBP: 0000000000000012 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
    R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
    ------------[ cut here ]------------
    DEBUG_LOCKS_WARN_ON(depth 0b 4c 8b 85
    70 ff ff ff 4c 89 ea 4c 89 e6 4c 89 c7 e8 52 63 ff
    RSP: 0018:ffff888094117b48 EFLAGS: 00010086
    RAX: 0000000000000000 RBX: 1ffff11012822f6f RCX: 0000000000000000
    RDX: 0000000000000000 RSI: ffffffff815af236 RDI: ffffed1012822f5b
    RBP: ffff888094117c00 R08: ffff888092bfc400 R09: fffffbfff113301d
    R10: fffffbfff113301c R11: ffffffff889980e3 R12: ffffffff8a451df8
    R13: ffffffff8142e71f R14: ffffffff8a44cc80 R15: ffff888094117bd8
    percpu_up_read.constprop.0+0xcb/0x110 include/linux/percpu-rwsem.h:92
    cgroup_threadgroup_change_end include/linux/cgroup-defs.h:712 [inline]
    copy_process.part.0+0x47ff/0x6710 kernel/fork.c:2222
    copy_process kernel/fork.c:1772 [inline]
    _do_fork+0x25d/0xfd0 kernel/fork.c:2338
    __do_compat_sys_x86_clone arch/x86/ia32/sys_ia32.c:240 [inline]
    __se_compat_sys_x86_clone arch/x86/ia32/sys_ia32.c:236 [inline]
    __ia32_compat_sys_x86_clone+0xbc/0x140 arch/x86/ia32/sys_ia32.c:236
    do_syscall_32_irqs_on arch/x86/entry/common.c:334 [inline]
    do_fast_syscall_32+0x281/0xd54 arch/x86/entry/common.c:405
    entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
    RIP: 0023:0xf7fec849
    Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 14 24 c3 8b 3c 24 c3 90 90
    90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 5a 59 c3 90
    90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
    RSP: 002b:00000000ffed5a8c EFLAGS: 00000246 ORIG_RAX: 0000000000000078
    RAX: ffffffffffffffda RBX: 0000000000003ffc RCX: 0000000000000000
    RDX: 00000000200005c0 RSI: 0000000000000000 RDI: 0000000000000000
    RBP: 0000000000000012 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
    R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
    Kernel Offset: disabled
    Rebooting in 86400 seconds..

    Reported-and-tested-by: syzbot+3286e58549edc479faae@syzkaller.appspotmail.com
    Fixes: b3e583825266 ("clone: add CLONE_PIDFD")
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • Pull cgroup updates from Tejun Heo:
    "This includes Roman's cgroup2 freezer implementation.

    It's a separate machanism from cgroup1 freezer. Instead of blocking
    user tasks in arbitrary uninterruptible sleeps, the new implementation
    extends jobctl stop - frozen tasks are trapped in jobctl stop until
    thawed and can be killed and ptraced. Lots of thanks to Oleg for
    sheperding the effort.

    Other than that, there are a few trivial changes"

    * 'for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: never call do_group_exit() with task->frozen bit set
    kernel: cgroup: fix misuse of %x
    cgroup: get rid of cgroup_freezer_frozen_exit()
    cgroup: prevent spurious transition into non-frozen state
    cgroup: Remove unused cgrp variable
    cgroup: document cgroup v2 freezer interface
    cgroup: add tracing points for cgroup v2 freezer
    cgroup: make TRACE_CGROUP_PATH irq-safe
    kselftests: cgroup: add freezer controller self-tests
    kselftests: cgroup: don't fail on cg_kill_all() error in cg_destroy()
    cgroup: cgroup v2 freezer
    cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock
    cgroup: implement __cgroup_task_count() helper
    cgroup: rename freezer.c into legacy_freezer.c
    cgroup: remove extra cgroup_migrate_finish() call

    Linus Torvalds
     

08 May, 2019

1 commit

  • Pull pidfd updates from Christian Brauner:
    "This patchset makes it possible to retrieve pidfds at process creation
    time by introducing the new flag CLONE_PIDFD to the clone() system
    call. Linus originally suggested to implement this as a new flag to
    clone() instead of making it a separate system call.

    After a thorough review from Oleg CLONE_PIDFD returns pidfds in the
    parent_tidptr argument. This means we can give back the associated pid
    and the pidfd at the same time. Access to process metadata information
    thus becomes rather trivial.

    As has been agreed, CLONE_PIDFD creates file descriptors based on
    anonymous inodes similar to the new mount api. They are made
    unconditional by this patchset as they are now needed by core kernel
    code (vfs, pidfd) even more than they already were before (timerfd,
    signalfd, io_uring, epoll etc.). The core patchset is rather small.
    The bulky looking changelist is caused by David's very simple changes
    to Kconfig to make anon inodes unconditional.

    A pidfd comes with additional information in fdinfo if the kernel
    supports procfs. The fdinfo file contains the pid of the process in
    the callers pid namespace in the same format as the procfs status
    file, i.e. "Pid:\t%d".

    To remove worries about missing metadata access this patchset comes
    with a sample/test program that illustrates how a combination of
    CLONE_PIDFD and pidfd_send_signal() can be used to gain race-free
    access to process metadata through /proc/.

    Further work based on this patchset has been done by Joel. His work
    makes pidfds pollable. It finished too late for this merge window. I
    would prefer to have it sitting in linux-next for a while and send it
    for inclusion during the 5.3 merge window"

    * tag 'pidfd-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    samples: show race-free pidfd metadata access
    signal: support CLONE_PIDFD with pidfd_send_signal
    clone: add CLONE_PIDFD
    Make anon_inodes unconditional

    Linus Torvalds
     

07 May, 2019

1 commit

  • This patchset makes it possible to retrieve pid file descriptors at
    process creation time by introducing the new flag CLONE_PIDFD to the
    clone() system call. Linus originally suggested to implement this as a
    new flag to clone() instead of making it a separate system call. As
    spotted by Linus, there is exactly one bit for clone() left.

    CLONE_PIDFD creates file descriptors based on the anonymous inode
    implementation in the kernel that will also be used to implement the new
    mount api. They serve as a simple opaque handle on pids. Logically,
    this makes it possible to interpret a pidfd differently, narrowing or
    widening the scope of various operations (e.g. signal sending). Thus, a
    pidfd cannot just refer to a tgid, but also a tid, or in theory - given
    appropriate flag arguments in relevant syscalls - a process group or
    session. A pidfd does not represent a privilege. This does not imply it
    cannot ever be that way but for now this is not the case.

    A pidfd comes with additional information in fdinfo if the kernel supports
    procfs. The fdinfo file contains the pid of the process in the callers
    pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d".

    As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the
    parent_tidptr argument of clone. This has the advantage that we can
    give back the associated pid and the pidfd at the same time.

    To remove worries about missing metadata access this patchset comes with
    a sample program that illustrates how a combination of CLONE_PIDFD, and
    pidfd_send_signal() can be used to gain race-free access to process
    metadata through /proc/. The sample program can easily be
    translated into a helper that would be suitable for inclusion in libc so
    that users don't have to worry about writing it themselves.

    Suggested-by: Linus Torvalds
    Signed-off-by: Christian Brauner
    Co-developed-by: Jann Horn
    Signed-off-by: Jann Horn
    Reviewed-by: Oleg Nesterov
    Cc: Arnd Bergmann
    Cc: "Eric W. Biederman"
    Cc: Kees Cook
    Cc: Thomas Gleixner
    Cc: David Howells
    Cc: "Michael Kerrisk (man-pages)"
    Cc: Andy Lutomirsky
    Cc: Andrew Morton
    Cc: Aleksa Sarai
    Cc: Linus Torvalds
    Cc: Al Viro

    Christian Brauner
     

30 Apr, 2019

2 commits

  • Provide a function for copying init_mm. This function will be later used
    for setting a temporary mm.

    Tested-by: Masami Hiramatsu
    Signed-off-by: Nadav Amit
    Signed-off-by: Rick Edgecombe
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Masami Hiramatsu
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190426001143.4983-6-namit@vmware.com
    Signed-off-by: Ingo Molnar

    Nadav Amit
     
  • In order to have a separate address space for text poking, we need to
    duplicate init_mm early during start_kernel(). This, however, introduces
    a problem since uprobes functions are called from dup_mmap(), but
    uprobes is still not initialized in this early stage.

    Since uprobes initialization is necassary for fork, and since all the
    dependant initialization has been done when fork is initialized (percpu
    and vmalloc), move uprobes initialization to fork_init(). It does not
    seem uprobes introduces any security problem for the poking_mm.

    Crash and burn if uprobes initialization fails, similarly to other early
    initializations. Change the init_probes() name to probes_init() to match
    other early initialization functions name convention.

    Reported-by: kernel test robot
    Signed-off-by: Nadav Amit
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andy Lutomirski
    Cc: Arnaldo Carvalho de Melo
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Rick Edgecombe
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Cc: akpm@linux-foundation.org
    Cc: ard.biesheuvel@linaro.org
    Cc: deneen.t.dock@intel.com
    Cc: kernel-hardening@lists.openwall.com
    Cc: kristen@linux.intel.com
    Cc: linux_dti@icloud.com
    Cc: will.deacon@arm.com
    Link: https://lkml.kernel.org/r/20190426232303.28381-6-nadav.amit@gmail.com
    Signed-off-by: Ingo Molnar

    Nadav Amit
     

20 Apr, 2019

1 commit

  • Cgroup v1 implements the freezer controller, which provides an ability
    to stop the workload in a cgroup and temporarily free up some
    resources (cpu, io, network bandwidth and, potentially, memory)
    for some other tasks. Cgroup v2 lacks this functionality.

    This patch implements freezer for cgroup v2.

    Cgroup v2 freezer tries to put tasks into a state similar to jobctl
    stop. This means that tasks can be killed, ptraced (using
    PTRACE_SEIZE*), and interrupted. It is possible to attach to
    a frozen task, get some information (e.g. read registers) and detach.
    It's also possible to migrate a frozen tasks to another cgroup.

    This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
    tried to imitate the system-wide freezer. However uninterruptible
    sleep is fine when all tasks are going to be frozen (hibernation case),
    it's not the acceptable state for some subset of the system.

    Cgroup v2 freezer is not supporting freezing kthreads.
    If a non-root cgroup contains kthread, the cgroup still can be frozen,
    but the kthread will remain running, the cgroup will be shown
    as non-frozen, and the notification will not be delivered.

    * PTRACE_ATTACH is not working because non-fatal signal delivery
    is blocked in frozen state.

    There are some interface differences between cgroup v1 and cgroup v2
    freezer too, which are required to conform the cgroup v2 interface
    design principles:
    1) There is no separate controller, which has to be turned on:
    the functionality is always available and is represented by
    cgroup.freeze and cgroup.events cgroup control files.
    2) The desired state is defined by the cgroup.freeze control file.
    Any hierarchical configuration is allowed.
    3) The interface is asynchronous. The actual state is available
    using cgroup.events control file ("frozen" field). There are no
    dedicated transitional states.
    4) It's allowed to make any changes with the cgroup hierarchy
    (create new cgroups, remove old cgroups, move tasks between cgroups)
    no matter if some cgroups are frozen.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Tejun Heo
    No-objection-from-me-by: Oleg Nesterov
    Cc: kernel-team@fb.com

    Roman Gushchin
     

10 Mar, 2019

1 commit

  • Pull rdma updates from Jason Gunthorpe:
    "This has been a slightly more active cycle than normal with ongoing
    core changes and quite a lot of collected driver updates.

    - Various driver fixes for bnxt_re, cxgb4, hns, mlx5, pvrdma, rxe

    - A new data transfer mode for HFI1 giving higher performance

    - Significant functional and bug fix update to the mlx5
    On-Demand-Paging MR feature

    - A chip hang reset recovery system for hns

    - Change mm->pinned_vm to an atomic64

    - Update bnxt_re to support a new 57500 chip

    - A sane netlink 'rdma link add' method for creating rxe devices and
    fixing the various unregistration race conditions in rxe's
    unregister flow

    - Allow lookup up objects by an ID over netlink

    - Various reworking of the core to driver interface:
    - drivers should not assume umem SGLs are in PAGE_SIZE chunks
    - ucontext is accessed via udata not other means
    - start to make the core code responsible for object memory
    allocation
    - drivers should convert struct device to struct ib_device via a
    helper
    - drivers have more tools to avoid use after unregister problems"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (280 commits)
    net/mlx5: ODP support for XRC transport is not enabled by default in FW
    IB/hfi1: Close race condition on user context disable and close
    RDMA/umem: Revert broken 'off by one' fix
    RDMA/umem: minor bug fix in error handling path
    RDMA/hns: Use GFP_ATOMIC in hns_roce_v2_modify_qp
    cxgb4: kfree mhp after the debug print
    IB/rdmavt: Fix concurrency panics in QP post_send and modify to error
    IB/rdmavt: Fix loopback send with invalidate ordering
    IB/iser: Fix dma_nents type definition
    IB/mlx5: Set correct write permissions for implicit ODP MR
    bnxt_re: Clean cq for kernel consumers only
    RDMA/uverbs: Don't do double free of allocated PD
    RDMA: Handle ucontext allocations by IB/core
    RDMA/core: Fix a WARN() message
    bnxt_re: fix the regression due to changes in alloc_pbl
    IB/mlx4: Increase the timeout for CM cache
    IB/core: Abort page fault handler silently during owning process exit
    IB/mlx5: Validate correct PD before prefetch MR
    IB/mlx5: Protect against prefetch of invalid MR
    RDMA/uverbs: Store PR pointer before it is overwritten
    ...

    Linus Torvalds
     

08 Mar, 2019

1 commit


08 Feb, 2019

1 commit

  • Taking a sleeping lock to _only_ increment a variable is quite the
    overkill, and pretty much all users do this. Furthermore, some drivers
    (ie: infiniband and scif) that need pinned semantics can go to quite
    some trouble to actually delay via workqueue (un)accounting for pinned
    pages when not possible to acquire it.

    By making the counter atomic we no longer need to hold the mmap_sem and
    can simply some code around it for pinned_vm users. The counter is 64-bit
    such that we need not worry about overflows such as rdma user input
    controlled from userspace.

    Reviewed-by: Ira Weiny
    Reviewed-by: Christoph Lameter
    Reviewed-by: Daniel Jordan
    Reviewed-by: Jan Kara
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Jason Gunthorpe

    Davidlohr Bueso
     

04 Feb, 2019

4 commits

  • atomic_t variables are currently used to implement reference
    counters with the following properties:

    - counter is initialized to 1 using atomic_set()
    - a resource is freed upon counter reaching zero
    - once counter reaches zero, its further
    increments aren't allowed
    - counter schema uses basic atomic operations
    (set, inc, inc_not_zero, dec_and_test, etc.)

    Such atomic variables should be converted to a newly provided
    refcount_t type and API that prevents accidental counter overflows
    and underflows. This is important since overflows and underflows
    can lead to use-after-free situation and be exploitable.

    The variable task_struct.stack_refcount is used as pure reference counter.
    Convert it to refcount_t and fix up the operations.

    ** Important note for maintainers:

    Some functions from refcount_t API defined in lib/refcount.c
    have different memory ordering guarantees than their atomic
    counterparts.

    The full comparison can be seen in
    https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
    in state to be merged to the documentation tree.

    Normally the differences should not matter since refcount_t provides
    enough guarantees to satisfy the refcounting use cases, but in
    some rare cases it might matter.

    Please double check that you don't have some undocumented
    memory guarantees for this variable usage.

    For the task_struct.stack_refcount it might make a difference
    in following places:

    - try_get_task_stack(): increment in refcount_inc_not_zero() only
    guarantees control dependency on success vs. fully ordered
    atomic counterpart
    - put_task_stack(): decrement in refcount_dec_and_test() only
    provides RELEASE ordering and control dependency on success
    vs. fully ordered atomic counterpart

    Suggested-by: Kees Cook
    Signed-off-by: Elena Reshetova
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: David Windsor
    Reviewed-by: Hans Liljestrand
    Reviewed-by: Andrea Parri
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: akpm@linux-foundation.org
    Cc: viro@zeniv.linux.org.uk
    Link: https://lkml.kernel.org/r/1547814450-18902-6-git-send-email-elena.reshetova@intel.com
    Signed-off-by: Ingo Molnar

    Elena Reshetova
     
  • atomic_t variables are currently used to implement reference
    counters with the following properties:

    - counter is initialized to 1 using atomic_set()
    - a resource is freed upon counter reaching zero
    - once counter reaches zero, its further
    increments aren't allowed
    - counter schema uses basic atomic operations
    (set, inc, inc_not_zero, dec_and_test, etc.)

    Such atomic variables should be converted to a newly provided
    refcount_t type and API that prevents accidental counter overflows
    and underflows. This is important since overflows and underflows
    can lead to use-after-free situation and be exploitable.

    The variable task_struct.usage is used as pure reference counter.
    Convert it to refcount_t and fix up the operations.

    ** Important note for maintainers:

    Some functions from refcount_t API defined in lib/refcount.c
    have different memory ordering guarantees than their atomic
    counterparts.

    The full comparison can be seen in
    https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
    in state to be merged to the documentation tree.

    Normally the differences should not matter since refcount_t provides
    enough guarantees to satisfy the refcounting use cases, but in
    some rare cases it might matter.

    Please double check that you don't have some undocumented
    memory guarantees for this variable usage.

    For the task_struct.usage it might make a difference
    in following places:

    - put_task_struct(): decrement in refcount_dec_and_test() only
    provides RELEASE ordering and control dependency on success
    vs. fully ordered atomic counterpart

    Suggested-by: Kees Cook
    Signed-off-by: Elena Reshetova
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: David Windsor
    Reviewed-by: Hans Liljestrand
    Reviewed-by: Andrea Parri
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: akpm@linux-foundation.org
    Cc: viro@zeniv.linux.org.uk
    Link: https://lkml.kernel.org/r/1547814450-18902-5-git-send-email-elena.reshetova@intel.com
    Signed-off-by: Ingo Molnar

    Elena Reshetova
     
  • atomic_t variables are currently used to implement reference
    counters with the following properties:

    - counter is initialized to 1 using atomic_set()
    - a resource is freed upon counter reaching zero
    - once counter reaches zero, its further
    increments aren't allowed
    - counter schema uses basic atomic operations
    (set, inc, inc_not_zero, dec_and_test, etc.)

    Such atomic variables should be converted to a newly provided
    refcount_t type and API that prevents accidental counter overflows
    and underflows. This is important since overflows and underflows
    can lead to use-after-free situation and be exploitable.

    The variable signal_struct.sigcnt is used as pure reference counter.
    Convert it to refcount_t and fix up the operations.

    ** Important note for maintainers:

    Some functions from refcount_t API defined in lib/refcount.c
    have different memory ordering guarantees than their atomic
    counterparts.

    The full comparison can be seen in
    https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
    in state to be merged to the documentation tree.

    Normally the differences should not matter since refcount_t provides
    enough guarantees to satisfy the refcounting use cases, but in
    some rare cases it might matter.

    Please double check that you don't have some undocumented
    memory guarantees for this variable usage.

    For the signal_struct.sigcnt it might make a difference
    in following places:

    - put_signal_struct(): decrement in refcount_dec_and_test() only
    provides RELEASE ordering and control dependency on success
    vs. fully ordered atomic counterpart

    Suggested-by: Kees Cook
    Signed-off-by: Elena Reshetova
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: David Windsor
    Reviewed-by: Hans Liljestrand
    Reviewed-by: Andrea Parri
    Reviewed-by: Oleg Nesterov
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: akpm@linux-foundation.org
    Cc: viro@zeniv.linux.org.uk
    Link: https://lkml.kernel.org/r/1547814450-18902-3-git-send-email-elena.reshetova@intel.com
    Signed-off-by: Ingo Molnar

    Elena Reshetova
     
  • atomic_t variables are currently used to implement reference
    counters with the following properties:

    - counter is initialized to 1 using atomic_set()
    - a resource is freed upon counter reaching zero
    - once counter reaches zero, its further
    increments aren't allowed
    - counter schema uses basic atomic operations
    (set, inc, inc_not_zero, dec_and_test, etc.)

    Such atomic variables should be converted to a newly provided
    refcount_t type and API that prevents accidental counter overflows
    and underflows. This is important since overflows and underflows
    can lead to use-after-free situation and be exploitable.

    The variable sighand_struct.count is used as pure reference counter.
    Convert it to refcount_t and fix up the operations.

    ** Important note for maintainers:

    Some functions from refcount_t API defined in lib/refcount.c
    have different memory ordering guarantees than their atomic
    counterparts.

    The full comparison can be seen in
    https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
    in state to be merged to the documentation tree.

    Normally the differences should not matter since refcount_t provides
    enough guarantees to satisfy the refcounting use cases, but in
    some rare cases it might matter.

    Please double check that you don't have some undocumented
    memory guarantees for this variable usage.

    For the sighand_struct.count it might make a difference
    in following places:

    - __cleanup_sighand: decrement in refcount_dec_and_test() only
    provides RELEASE ordering and control dependency on success
    vs. fully ordered atomic counterpart

    Suggested-by: Kees Cook
    Signed-off-by: Elena Reshetova
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: David Windsor
    Reviewed-by: Hans Liljestrand
    Reviewed-by: Andrea Parri
    Reviewed-by: Oleg Nesterov
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: akpm@linux-foundation.org
    Cc: viro@zeniv.linux.org.uk
    Link: https://lkml.kernel.org/r/1547814450-18902-2-git-send-email-elena.reshetova@intel.com
    Signed-off-by: Ingo Molnar

    Elena Reshetova
     

09 Jan, 2019

3 commits

  • Merge misc fixes from Andrew Morton:
    "14 fixes"

    * emailed patches from Andrew Morton :
    mm, page_alloc: do not wake kswapd with zone lock held
    hugetlbfs: revert "use i_mmap_rwsem for more pmd sharing synchronization"
    hugetlbfs: revert "Use i_mmap_rwsem to fix page fault/truncate race"
    mm: page_mapped: don't assume compound page is huge or THP
    mm/memory.c: initialise mmu_notifier_range correctly
    tools/vm/page_owner: use page_owner_sort in the use example
    kasan: fix krealloc handling for tag-based mode
    kasan: make tag based mode work with CONFIG_HARDENED_USERCOPY
    kasan, arm64: use ARCH_SLAB_MINALIGN instead of manual aligning
    mm, memcg: fix reclaim deadlock with writeback
    mm/usercopy.c: no check page span for stack objects
    slab: alien caches must not be initialized if the allocation of the alien cache failed
    fork, memcg: fix cached_stacks case
    zram: idle writeback fixes and cleanup

    Linus Torvalds
     
  • Commit 5eed6f1dff87 ("fork,memcg: fix crash in free_thread_stack on
    memcg charge fail") fixes a crash caused due to failed memcg charge of
    the kernel stack. However the fix misses the cached_stacks case which
    this patch fixes. So, the same crash can happen if the memcg charge of
    a cached stack is failed.

    Link: http://lkml.kernel.org/r/20190102180145.57406-1-shakeelb@google.com
    Fixes: 5eed6f1dff87 ("fork,memcg: fix crash in free_thread_stack on memcg charge fail")
    Signed-off-by: Shakeel Butt
    Acked-by: Michal Hocko
    Acked-by: Rik van Riel
    Cc: Rik van Riel
    Cc: Roman Gushchin
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • This changes the fork(2) syscall to record the process start_time after
    initializing the basic task structure but still before making the new
    process visible to user-space.

    Technically, we could record the start_time anytime during fork(2). But
    this might lead to scenarios where a start_time is recorded long before
    a process becomes visible to user-space. For instance, with
    userfaultfd(2) and TLS, user-space can delay the execution of fork(2)
    for an indefinite amount of time (and will, if this causes network
    access, or similar).

    By recording the start_time late, it much closer reflects the point in
    time where the process becomes live and can be observed by other
    processes.

    Lastly, this makes it much harder for user-space to predict and control
    the start_time they get assigned. Previously, user-space could fork a
    process and stall it in copy_thread_tls() before its pid is allocated,
    but after its start_time is recorded. This can be misused to later-on
    cycle through PIDs and resume the stalled fork(2) yielding a process
    that has the same pid and start_time as a process that existed before.
    This can be used to circumvent security systems that identify processes
    by their pid+start_time combination.

    Even though user-space was always aware that start_time recording is
    flaky (but several projects are known to still rely on start_time-based
    identification), changing the start_time to be recorded late will help
    mitigate existing attacks and make it much harder for user-space to
    control the start_time a process gets assigned.

    Reported-by: Jann Horn
    Signed-off-by: Tom Gundersen
    Signed-off-by: David Herrmann
    Signed-off-by: Linus Torvalds

    David Herrmann
     

05 Jan, 2019

1 commit

  • We get a warning when building kernel with W=1:

    kernel/fork.c:167:13: warning: no previous prototype for `arch_release_thread_stack' [-Wmissing-prototypes]
    kernel/fork.c:779:13: warning: no previous prototype for `fork_init' [-Wmissing-prototypes]

    Add the missing declaration in head file to fix this.

    Also, remove arch_release_thread_stack() completely because no arch
    seems to implement it since bb9d81264 (arch: remove tile port).

    Link: http://lkml.kernel.org/r/1542170087-23645-1-git-send-email-wang.yi59@zte.com.cn
    Signed-off-by: Yi Wang
    Acked-by: Michal Hocko
    Acked-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yi Wang
     

29 Dec, 2018

3 commits

  • Fixes gcc '-Wunused-but-set-variable' warning when CONFIG_VMAP_STACK is
    not set:

    kernel/fork.c: In function 'dup_task_struct':
    kernel/fork.c:843:20: warning:
    variable 'stack_vm_area' set but not used [-Wunused-but-set-variable]

    Link: http://lkml.kernel.org/r/1545965190-2381-1-git-send-email-yuehaibing@huawei.com
    Signed-off-by: YueHaibing
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    YueHaibing
     
  • totalram_pages and totalhigh_pages are made static inline function.

    Main motivation was that managed_page_count_lock handling was complicating
    things. It was discussed in length here,
    https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
    better to remove the lock and convert variables to atomic, with preventing
    poteintial store-to-read tearing as a bonus.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
    Signed-off-by: Arun KS
    Suggested-by: Michal Hocko
    Suggested-by: Vlastimil Babka
    Reviewed-by: Konstantin Khlebnikov
    Reviewed-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun KS
     
  • Patch series "mm: convert totalram_pages, totalhigh_pages and managed
    pages to atomic", v5.

    This series converts totalram_pages, totalhigh_pages and
    zone->managed_pages to atomic variables.

    totalram_pages, zone->managed_pages and totalhigh_pages updates are
    protected by managed_page_count_lock, but readers never care about it.
    Convert these variables to atomic to avoid readers potentially seeing a
    store tear.

    Main motivation was that managed_page_count_lock handling was complicating
    things. It was discussed in length here,
    https://lore.kernel.org/patchwork/patch/995739/#1181785 It seemes better
    to remove the lock and convert variables to atomic. With the change,
    preventing poteintial store-to-read tearing comes as a bonus.

    This patch (of 4):

    This is in preparation to a later patch which converts totalram_pages and
    zone->managed_pages to atomic variables. Please note that re-reading the
    value might lead to a different value and as such it could lead to
    unexpected behavior. There are no known bugs as a result of the current
    code but it is better to prevent from them in principle.

    Link: http://lkml.kernel.org/r/1542090790-21750-2-git-send-email-arunks@codeaurora.org
    Signed-off-by: Arun KS
    Reviewed-by: Konstantin Khlebnikov
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Reviewed-by: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun KS
     

22 Dec, 2018

1 commit

  • Commit 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting") will
    result in fork failing if allocating a kernel stack for a task in
    dup_task_struct exceeds the kernel memory allowance for that cgroup.

    Unfortunately, it also results in a crash.

    This is due to the code jumping to free_stack and calling
    free_thread_stack when the memcg kernel stack charge fails, but without
    tsk->stack pointing at the freshly allocated stack.

    This in turn results in the vfree_atomic in free_thread_stack oopsing
    with a backtrace like this:

    #5 [ffffc900244efc88] die at ffffffff8101f0ab
    #6 [ffffc900244efcb8] do_general_protection at ffffffff8101cb86
    #7 [ffffc900244efce0] general_protection at ffffffff818ff082
    [exception RIP: llist_add_batch+7]
    RIP: ffffffff8150d487 RSP: ffffc900244efd98 RFLAGS: 00010282
    RAX: 0000000000000000 RBX: ffff88085ef55980 RCX: 0000000000000000
    RDX: ffff88085ef55980 RSI: 343834343531203a RDI: 343834343531203a
    RBP: ffffc900244efd98 R8: 0000000000000001 R9: ffff8808578c3600
    R10: 0000000000000000 R11: 0000000000000001 R12: ffff88029f6c21c0
    R13: 0000000000000286 R14: ffff880147759b00 R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
    #8 [ffffc900244efda0] vfree_atomic at ffffffff811df2c7
    #9 [ffffc900244efdb8] copy_process at ffffffff81086e37
    #10 [ffffc900244efe98] _do_fork at ffffffff810884e0
    #11 [ffffc900244eff10] sys_vfork at ffffffff810887ff
    #12 [ffffc900244eff20] do_syscall_64 at ffffffff81002a43
    RIP: 000000000049b948 RSP: 00007ffcdb307830 RFLAGS: 00000246
    RAX: ffffffffffffffda RBX: 0000000000896030 RCX: 000000000049b948
    RDX: 0000000000000000 RSI: 00007ffcdb307790 RDI: 00000000005d7421
    RBP: 000000000067370f R8: 00007ffcdb3077b0 R9: 000000000001ed00
    R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000040
    R13: 000000000000000f R14: 0000000000000000 R15: 000000000088d018
    ORIG_RAX: 000000000000003a CS: 0033 SS: 002b

    The simplest fix is to assign tsk->stack right where it is allocated.

    Link: http://lkml.kernel.org/r/20181214231726.7ee4843c@imladris.surriel.com
    Fixes: 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
    Signed-off-by: Rik van Riel
    Acked-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

02 Nov, 2018

1 commit

  • Pull stackleak gcc plugin from Kees Cook:
    "Please pull this new GCC plugin, stackleak, for v4.20-rc1. This plugin
    was ported from grsecurity by Alexander Popov. It provides efficient
    stack content poisoning at syscall exit. This creates a defense
    against at least two classes of flaws:

    - Uninitialized stack usage. (We continue to work on improving the
    compiler to do this in other ways: e.g. unconditional zero init was
    proposed to GCC and Clang, and more plugin work has started too).

    - Stack content exposure. By greatly reducing the lifetime of valid
    stack contents, exposures via either direct read bugs or unknown
    cache side-channels become much more difficult to exploit. This
    complements the existing buddy and heap poisoning options, but
    provides the coverage for stacks.

    The x86 hooks are included in this series (which have been reviewed by
    Ingo, Dave Hansen, and Thomas Gleixner). The arm64 hooks have already
    been merged through the arm64 tree (written by Laura Abbott and
    reviewed by Mark Rutland and Will Deacon).

    With VLAs having been removed this release, there is no need for
    alloca() protection, so it has been removed from the plugin"

    * tag 'stackleak-v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    arm64: Drop unneeded stackleak_check_alloca()
    stackleak: Allow runtime disabling of kernel stack erasing
    doc: self-protection: Add information about STACKLEAK feature
    fs/proc: Show STACKLEAK metrics in the /proc file system
    lkdtm: Add a test for STACKLEAK
    gcc-plugins: Add STACKLEAK plugin for tracking the kernel stack
    x86/entry: Add STACKLEAK erasing the kernel stack at the end of syscalls

    Linus Torvalds
     

27 Oct, 2018

2 commits

  • When systems are overcommitted and resources become contended, it's hard
    to tell exactly the impact this has on workload productivity, or how close
    the system is to lockups and OOM kills. In particular, when machines work
    multiple jobs concurrently, the impact of overcommit in terms of latency
    and throughput on the individual job can be enormous.

    In order to maximize hardware utilization without sacrificing individual
    job health or risk complete machine lockups, this patch implements a way
    to quantify resource pressure in the system.

    A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
    expose the percentage of time the system is stalled on CPU, memory, or IO,
    respectively. Stall states are aggregate versions of the per-task delay
    accounting delays:

    cpu: some tasks are runnable but not executing on a CPU
    memory: tasks are reclaiming, or waiting for swapin or thrashing cache
    io: tasks are waiting for io completions

    These percentages of walltime can be thought of as pressure percentages,
    and they give a general sense of system health and productivity loss
    incurred by resource overcommit. They can also indicate when the system
    is approaching lockup scenarios and OOMs.

    To do this, psi keeps track of the task states associated with each CPU
    and samples the time they spend in stall states. Every 2 seconds, the
    samples are averaged across CPUs - weighted by the CPUs' non-idle time to
    eliminate artifacts from unused CPUs - and translated into percentages of
    walltime. A running average of those percentages is maintained over 10s,
    1m, and 5m periods (similar to the loadaverage).

    [hannes@cmpxchg.org: doc fixlet, per Randy]
    Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
    [hannes@cmpxchg.org: code optimization]
    Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
    [hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
    Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
    [hannes@cmpxchg.org: fix build]
    Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
    Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Peter Zijlstra (Intel)
    Tested-by: Daniel Drake
    Tested-by: Suren Baghdasaryan
    Cc: Christopher Lameter
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Mike Galbraith
    Cc: Peter Enderborg
    Cc: Randy Dunlap
    Cc: Shakeel Butt
    Cc: Tejun Heo
    Cc: Vinayak Menon
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • If CONFIG_VMAP_STACK is set, kernel stacks are allocated using
    __vmalloc_node_range() with __GFP_ACCOUNT. So kernel stack pages are
    charged against corresponding memory cgroups on allocation and uncharged
    on releasing them.

    The problem is that we do cache kernel stacks in small per-cpu caches and
    do reuse them for new tasks, which can belong to different memory cgroups.

    Each stack page still holds a reference to the original cgroup, so the
    cgroup can't be released until the vmap area is released.

    To make this happen we need more than two subsequent exits without forks
    in between on the current cpu, which makes it very unlikely to happen. As
    a result, I saw a significant number of dying cgroups (in theory, up to 2
    * number_of_cpu + number_of_tasks), which can't be released even by
    significant memory pressure.

    As a cgroup structure can take a significant amount of memory (first of
    all, per-cpu data like memcg statistics), it leads to a noticeable waste
    of memory.

    Link: http://lkml.kernel.org/r/20180827162621.30187-1-guro@fb.com
    Fixes: ac496bf48d97 ("fork: Optimize task creation by caching two thread stacks per CPU if CONFIG_VMAP_STACK=y")
    Signed-off-by: Roman Gushchin
    Reviewed-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Andy Lutomirski
    Cc: Konstantin Khlebnikov
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

05 Sep, 2018

1 commit

  • Commit d70f2a14b72a ("include/linux/sched/mm.h: uninline mmdrop_async(),
    etc") ignored the return value of arch_dup_mmap(). As a result, on x86,
    a failure to duplicate the LDT (e.g. due to memory allocation error)
    would leave the duplicated memory mapping in an inconsistent state.

    Fix by using the return value, as it was before the change.

    Link: http://lkml.kernel.org/r/20180823051229.211856-1-namit@vmware.com
    Fixes: d70f2a14b72a4 ("include/linux/sched/mm.h: uninline mmdrop_async(), etc")
    Signed-off-by: Nadav Amit
    Acked-by: Michal Hocko
    Cc:

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nadav Amit