25 Aug, 2017

1 commit

  • commit dd1c1f2f2028a7b851f701fc6a8ebe39dcb95e7c upstream.

    This was reported many times, and this was even mentioned in commit
    52ee2dfdd4f5 ("pids: refactor vnr/nr_ns helpers to make them safe") but
    somehow nobody bothered to fix the obvious problem: task_tgid_nr_ns() is
    not safe because task->group_leader points to nowhere after the exiting
    task passes exit_notify(), rcu_read_lock() can not help.

    We really need to change __unhash_process() to nullify group_leader,
    parent, and real_parent, but this needs some cleanups. Until then we
    can turn task_tgid_nr_ns() into another user of __task_pid_nr_ns() and
    fix the problem.

    Reported-by: Troy Kensinger
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Oleg Nesterov
     

11 Aug, 2017

1 commit

  • [ Upstream commit 2d39b3cd34e6d323720d4c61bd714f5ae202c022 ]

    Since commit 00cd5c37afd5 ("ptrace: permit ptracing of /sbin/init") we
    can now trace init processes. init is initially protected with
    SIGNAL_UNKILLABLE which will prevent fatal signals such as SIGSTOP, but
    there are a number of paths during tracing where SIGNAL_UNKILLABLE can
    be implicitly cleared.

    This can result in init becoming stoppable/killable after tracing. For
    example, running:

    while true; do kill -STOP 1; done &
    strace -p 1

    and then stopping strace and the kill loop will result in init being
    left in state TASK_STOPPED. Sending SIGCONT to init will resume it, but
    init will now respond to future SIGSTOP signals rather than ignoring
    them.

    Make sure that when setting SIGNAL_STOP_CONTINUED/SIGNAL_STOP_STOPPED
    that we don't clear SIGNAL_UNKILLABLE.

    Link: http://lkml.kernel.org/r/20170104122017.25047-1-jamie.iles@oracle.com
    Signed-off-by: Jamie Iles
    Acked-by: Oleg Nesterov
    Cc: Alexander Viro
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Jamie Iles
     

21 Apr, 2017

1 commit

  • commit 77f88796cee819b9c4562b0b6b44691b3b7755b1 upstream.

    Creation of a kthread goes through a couple interlocked stages between
    the kthread itself and its creator. Once the new kthread starts
    running, it initializes itself and wakes up the creator. The creator
    then can further configure the kthread and then let it start doing its
    job by waking it up.

    In this configuration-by-creator stage, the creator is the only one
    that can wake it up but the kthread is visible to userland. When
    altering the kthread's attributes from userland is allowed, this is
    fine; however, for cases where CPU affinity is critical,
    kthread_bind() is used to first disable affinity changes from userland
    and then set the affinity. This also prevents the kthread from being
    migrated into non-root cgroups as that can affect the CPU affinity and
    many other things.

    Unfortunately, the cgroup side of protection is racy. While the
    PF_NO_SETAFFINITY flag prevents further migrations, userland can win
    the race before the creator sets the flag with kthread_bind() and put
    the kthread in a non-root cgroup, which can lead to all sorts of
    problems including incorrect CPU affinity and starvation.

    This bug got triggered by userland which periodically tries to migrate
    all processes in the root cpuset cgroup to a non-root one. Per-cpu
    workqueue workers got caught while being created and ended up with
    incorrected CPU affinity breaking concurrency management and sometimes
    stalling workqueue execution.

    This patch adds task->no_cgroup_migration which disallows the task to
    be migrated by userland. kthreadd starts with the flag set making
    every child kthread start in the root cgroup with migration
    disallowed. The flag is cleared after the kthread finishes
    initialization by which time PF_NO_SETAFFINITY is set if the kthread
    should stay in the root cgroup.

    It'd be better to wait for the initialization instead of failing but I
    couldn't think of a way of implementing that without adding either a
    new PF flag, or sleeping and retrying from waiting side. Even if
    userland depends on changing cgroup membership of a kthread, it either
    has to be synchronized with kthread_create() or periodically repeat,
    so it's unlikely that this would break anything.

    v2: Switch to a simpler implementation using a new task_struct bit
    field suggested by Oleg.

    Signed-off-by: Tejun Heo
    Suggested-by: Oleg Nesterov
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra (Intel)
    Cc: Thomas Gleixner
    Reported-and-debugged-by: Chris Mason
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

06 Jan, 2017

1 commit

  • commit 64b875f7ac8a5d60a4e191479299e931ee949b67 upstream.

    When the flag PT_PTRACE_CAP was added the PTRACE_TRACEME path was
    overlooked. This can result in incorrect behavior when an application
    like strace traces an exec of a setuid executable.

    Further PT_PTRACE_CAP does not have enough information for making good
    security decisions as it does not report which user namespace the
    capability is in. This has already allowed one mistake through
    insufficient granulariy.

    I found this issue when I was testing another corner case of exec and
    discovered that I could not get strace to set PT_PTRACE_CAP even when
    running strace as root with a full set of caps.

    This change fixes the above issue with strace allowing stracing as
    root a setuid executable without disabling setuid. More fundamentaly
    this change allows what is allowable at all times, by using the correct
    information in it's decision.

    Fixes: 4214e42f96d4 ("v2.4.9.11 -> v2.4.9.12")
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

22 Nov, 2016

1 commit

  • Exactly because for_each_thread() in autogroup_move_group() can't see it
    and update its ->sched_task_group before _put() and possibly free().

    So the exiting task needs another sched_move_task() before exit_notify()
    and we need to re-introduce the PF_EXITING (or similar) check removed by
    the previous change for another reason.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: hartsjc@redhat.com
    Cc: vbendel@redhat.com
    Cc: vlovejoy@redhat.com
    Link: http://lkml.kernel.org/r/20161114184612.GA15968@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

08 Oct, 2016

6 commits

  • The global zero page is used to satisfy an anonymous read fault. If
    THP(Transparent HugePage) is enabled then the global huge zero page is
    used. The global huge zero page uses an atomic counter for reference
    counting and is allocated/freed dynamically according to its counter
    value.

    CPU time spent on that counter will greatly increase if there are a lot
    of processes doing anonymous read faults. This patch proposes a way to
    reduce the access to the global counter so that the CPU load can be
    reduced accordingly.

    To do this, a new flag of the mm_struct is introduced:
    MMF_USED_HUGE_ZERO_PAGE. With this flag, the process only need to touch
    the global counter in two cases:

    1 The first time it uses the global huge zero page;
    2 The time when mm_user of its mm_struct reaches zero.

    Note that right now, the huge zero page is eligible to be freed as soon
    as its last use goes away. With this patch, the page will not be
    eligible to be freed until the exit of the last process from which it
    was ever used.

    And with the use of mm_user, the kthread is not eligible to use huge
    zero page either. Since no kthread is using huge zero page today, there
    is no difference after applying this patch. But if that is not desired,
    I can change it to when mm_count reaches zero.

    Case used for test on Haswell EP:

    usemem -n 72 --readonly -j 0x200000 100G

    Which spawns 72 processes and each will mmap 100G anonymous space and
    then do read only access to that space sequentially with a step of 2MB.

    CPU cycles from perf report for base commit:
    54.03% usemem [kernel.kallsyms] [k] get_huge_zero_page
    CPU cycles from perf report for this commit:
    0.11% usemem [kernel.kallsyms] [k] mm_get_huge_zero_page

    Performance(throughput) of the workload for base commit: 1784430792
    Performance(throughput) of the workload for this commit: 4726928591
    164% increase.

    Runtime of the workload for base commit: 707592 us
    Runtime of the workload for this commit: 303970 us
    50% drop.

    Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com
    Signed-off-by: Aaron Lu
    Cc: Sergey Senozhatsky
    Cc: "Kirill A. Shutemov"
    Cc: Dave Hansen
    Cc: Tim Chen
    Cc: Huang Ying
    Cc: Vlastimil Babka
    Cc: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Ebru Akagunduz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Lu
     
  • There are only few use_mm() users in the kernel right now. Most of them
    write to the target memory but vhost driver relies on
    copy_from_user/get_user from a kernel thread context. This makes it
    impossible to reap the memory of an oom victim which shares the mm with
    the vhost kernel thread because it could see a zero page unexpectedly
    and theoretically make an incorrect decision visible outside of the
    killed task context.

    To quote Michael S. Tsirkin:
    : Getting an error from __get_user and friends is handled gracefully.
    : Getting zero instead of a real value will cause userspace
    : memory corruption.

    The vhost kernel thread is bound to an open fd of the vhost device which
    is not tight to the mm owner life cycle in general. The device fd can
    be inherited or passed over to another process which means that we
    really have to be careful about unexpected memory corruption because
    unlike for normal oom victims the result will be visible outside of the
    oom victim context.

    Make sure that no kthread context (users of use_mm) can ever see
    corrupted data because of the oom reaper and hook into the page fault
    path by checking MMF_UNSTABLE mm flag. __oom_reap_task_mm will set the
    flag before it starts unmapping the address space while the flag is
    checked after the page fault has been handled. If the flag is set then
    SIGBUS is triggered so any g-u-p user will get a error code.

    Regular tasks do not need this protection because all which share the mm
    are killed when the mm is reaped and so the corruption will not outlive
    them.

    This patch shouldn't have any visible effect at this moment because the
    OOM killer doesn't invoke oom reaper for tasks with mm shared with
    kthreads yet.

    Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: "Michael S. Tsirkin"
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • After "oom: keep mm of the killed task available" we can safely detect
    an oom victim by checking task->signal->oom_mm so we do not need the
    signal_struct counter anymore so let's get rid of it.

    This alone wouldn't be sufficient for nommu archs because
    exit_oom_victim doesn't hide the process from the oom killer anymore.
    We can, however, mark the mm with a MMF flag in __mmput. We can reuse
    MMF_OOM_REAPED and rename it to a more generic MMF_OOM_SKIP.

    Link: http://lkml.kernel.org/r/1472119394-11342-6-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Lockdep complains that __mmdrop is not safe from the softirq context:

    =================================
    [ INFO: inconsistent lock state ]
    4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949 Tainted: G W
    ---------------------------------
    inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
    swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
    (pgd_lock){+.?...}, at: pgd_free+0x19/0x6b
    {SOFTIRQ-ON-W} state was registered at:
    __lock_acquire+0xa06/0x196e
    lock_acquire+0x139/0x1e1
    _raw_spin_lock+0x32/0x41
    __change_page_attr_set_clr+0x2a5/0xacd
    change_page_attr_set_clr+0x16f/0x32c
    set_memory_nx+0x37/0x3a
    free_init_pages+0x9e/0xc7
    alternative_instructions+0xa2/0xb3
    check_bugs+0xe/0x2d
    start_kernel+0x3ce/0x3ea
    x86_64_start_reservations+0x2a/0x2c
    x86_64_start_kernel+0x17a/0x18d
    irq event stamp: 105916
    hardirqs last enabled at (105916): free_hot_cold_page+0x37e/0x390
    hardirqs last disabled at (105915): free_hot_cold_page+0x2c1/0x390
    softirqs last enabled at (105878): _local_bh_enable+0x42/0x44
    softirqs last disabled at (105879): irq_exit+0x6f/0xd1

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(pgd_lock);

    lock(pgd_lock);

    *** DEADLOCK ***

    1 lock held by swapper/1/0:
    #0: (rcu_callback){......}, at: rcu_process_callbacks+0x390/0x800

    stack backtrace:
    CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W 4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
    Call Trace:

    print_usage_bug.part.25+0x259/0x268
    mark_lock+0x381/0x567
    __lock_acquire+0x993/0x196e
    lock_acquire+0x139/0x1e1
    _raw_spin_lock+0x32/0x41
    pgd_free+0x19/0x6b
    __mmdrop+0x25/0xb9
    __put_task_struct+0x103/0x11e
    delayed_put_task_struct+0x157/0x15e
    rcu_process_callbacks+0x660/0x800
    __do_softirq+0x1ec/0x4d5
    irq_exit+0x6f/0xd1
    smp_apic_timer_interrupt+0x42/0x4d
    apic_timer_interrupt+0x8e/0xa0

    arch_cpu_idle+0xf/0x11
    default_idle_call+0x32/0x34
    cpu_startup_entry+0x20c/0x399
    start_secondary+0xfe/0x101

    More over commit a79e53d85683 ("x86/mm: Fix pgd_lock deadlock") was
    explicit about pgd_lock not to be called from the irq context. This
    means that __mmdrop called from free_signal_struct has to be postponed
    to a user context. We already have a similar mechanism for mmput_async
    so we can use it here as well. This is safe because mm_count is pinned
    by mm_users.

    This fixes bug introduced by "oom: keep mm of the killed task available"

    Link: http://lkml.kernel.org/r/1472119394-11342-5-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • oom_reap_task has to call exit_oom_victim in order to make sure that the
    oom vicim will not block the oom killer for ever. This is, however,
    opening new problems (e.g oom_killer_disable exclusion - see commit
    74070542099c ("oom, suspend: fix oom_reaper vs. oom_killer_disable
    race")). exit_oom_victim should be only called from the victim's
    context ideally.

    One way to achieve this would be to rely on per mm_struct flags. We
    already have MMF_OOM_REAPED to hide a task from the oom killer since
    "mm, oom: hide mm which is shared with kthread or global init". The
    problem is that the exit path:

    do_exit
    exit_mm
    tsk->mm = NULL;
    mmput
    __mmput
    exit_oom_victim

    doesn't guarantee that exit_oom_victim will get called in a bounded
    amount of time. At least exit_aio depends on IO which might get blocked
    due to lack of memory and who knows what else is lurking there.

    This patch takes a different approach. We remember tsk->mm into the
    signal_struct and bind it to the signal struct life time for all oom
    victims. __oom_reap_task_mm as well as oom_scan_process_thread do not
    have to rely on find_lock_task_mm anymore and they will have a reliable
    reference to the mm struct. As a result all the oom specific
    communication inside the OOM killer can be done via tsk->signal->oom_mm.

    Increasing the signal_struct for something as unlikely as the oom killer
    is far from ideal but this approach will make the code much more
    reasonable and long term we even might want to move task->mm into the
    signal_struct anyway. In the next step we might want to make the oom
    killer exclusion and access to memory reserves completely independent
    which would be also nice.

    Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • "mm, oom_reaper: do not attempt to reap a task twice" tried to give the
    OOM reaper one more chance to retry using MMF_OOM_NOT_REAPABLE flag.
    But the usefulness of the flag is rather limited and actually never
    shown in practice. If the flag is set, it means that the holder of
    mm->mmap_sem cannot call up_write() due to presumably being blocked at
    unkillable wait waiting for other thread's memory allocation. But since
    one of threads sharing that mm will queue that mm immediately via
    task_will_free_mem() shortcut (otherwise, oom_badness() will select the
    same mm again due to oom_score_adj value unchanged), retrying
    MMF_OOM_NOT_REAPABLE mm is unlikely helpful.

    Let's always set MMF_OOM_REAPED.

    Link: http://lkml.kernel.org/r/1472119394-11342-3-git-send-email-mhocko@kernel.org
    Signed-off-by: Tetsuo Handa
    Signed-off-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

04 Oct, 2016

2 commits

  • Pull low-level x86 updates from Ingo Molnar:
    "In this cycle this topic tree has become one of those 'super topics'
    that accumulated a lot of changes:

    - Add CONFIG_VMAP_STACK=y support to the core kernel and enable it on
    x86 - preceded by an array of changes. v4.8 saw preparatory changes
    in this area already - this is the rest of the work. Includes the
    thread stack caching performance optimization. (Andy Lutomirski)

    - switch_to() cleanups and all around enhancements. (Brian Gerst)

    - A large number of dumpstack infrastructure enhancements and an
    unwinder abstraction. The secret long term plan is safe(r) live
    patching plus maybe another attempt at debuginfo based unwinding -
    but all these current bits are standalone enhancements in a frame
    pointer based debug environment as well. (Josh Poimboeuf)

    - More __ro_after_init and const annotations. (Kees Cook)

    - Enable KASLR for the vmemmap memory region. (Thomas Garnier)"

    [ The virtually mapped stack changes are pretty fundamental, and not
    x86-specific per se, even if they are only used on x86 right now. ]

    * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
    x86/asm: Get rid of __read_cr4_safe()
    thread_info: Use unsigned long for flags
    x86/alternatives: Add stack frame dependency to alternative_call_2()
    x86/dumpstack: Fix show_stack() task pointer regression
    x86/dumpstack: Remove dump_trace() and related callbacks
    x86/dumpstack: Convert show_trace_log_lvl() to use the new unwinder
    oprofile/x86: Convert x86_backtrace() to use the new unwinder
    x86/stacktrace: Convert save_stack_trace_*() to use the new unwinder
    perf/x86: Convert perf_callchain_kernel() to use the new unwinder
    x86/unwind: Add new unwind interface and implementations
    x86/dumpstack: Remove NULL task pointer convention
    fork: Optimize task creation by caching two thread stacks per CPU if CONFIG_VMAP_STACK=y
    sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK
    lib/syscall: Pin the task stack in collect_syscall()
    x86/process: Pin the target stack in get_wchan()
    x86/dumpstack: Pin the target stack when dumping it
    kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function
    sched/core: Add try_get_task_stack() and put_task_stack()
    x86/entry/64: Fix a minor comment rebase error
    iommu/amd: Don't put completion-wait semaphore on stack
    ...

    Linus Torvalds
     
  • Pull scheduler changes from Ingo Molnar:
    "The main changes are:

    - irqtime accounting cleanups and enhancements. (Frederic Weisbecker)

    - schedstat debugging enhancements, make it more broadly runtime
    available. (Josh Poimboeuf)

    - More work on asymmetric topology/capacity scheduling. (Morten
    Rasmussen)

    - sched/wait fixes and cleanups. (Oleg Nesterov)

    - PELT (per entity load tracking) improvements. (Peter Zijlstra)

    - Rewrite and enhance select_idle_siblings(). (Peter Zijlstra)

    - sched/numa enhancements/fixes (Rik van Riel)

    - sched/cputime scalability improvements (Stanislaw Gruszka)

    - Load calculation arithmetics fixes. (Dietmar Eggemann)

    - sched/deadline enhancements (Tommaso Cucinotta)

    - Fix utilization accounting when switching to the SCHED_NORMAL
    policy. (Vincent Guittot)

    - ... plus misc cleanups and enhancements"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
    sched/irqtime: Consolidate irqtime flushing code
    sched/irqtime: Consolidate accounting synchronization with u64_stats API
    u64_stats: Introduce IRQs disabled helpers
    sched/irqtime: Remove needless IRQs disablement on kcpustat update
    sched/irqtime: No need for preempt-safe accessors
    sched/fair: Fix min_vruntime tracking
    sched/debug: Add SCHED_WARN_ON()
    sched/core: Fix set_user_nice()
    sched/fair: Introduce set_curr_task() helper
    sched/core, ia64: Rename set_curr_task()
    sched/core: Fix incorrect utilization accounting when switching to fair class
    sched/core: Optimize SCHED_SMT
    sched/core: Rewrite and improve select_idle_siblings()
    sched/core: Replace sd_busy/nr_busy_cpus with sched_domain_shared
    sched/core: Introduce 'struct sched_domain_shared'
    sched/core: Restructure destroy_sched_domain()
    sched/core: Remove unused @cpu argument from destroy_sched_domain*()
    sched/wait: Introduce init_wait_entry()
    sched/wait: Avoid abort_exclusive_wait() in __wait_on_bit_lock()
    sched/wait: Avoid abort_exclusive_wait() in ___wait_event()
    ...

    Linus Torvalds
     

30 Sep, 2016

4 commits

  • Rename the ia64 only set_curr_task() function to free up the name.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • select_idle_siblings() is a known pain point for a number of
    workloads; it either does too much or not enough and sometimes just
    does plain wrong.

    This rewrite attempts to address a number of issues (but sadly not
    all).

    The current code does an unconditional sched_domain iteration; with
    the intent of finding an idle core (on SMT hardware). The problems
    which this patch tries to address are:

    - its pointless to look for idle cores if the machine is real busy;
    at which point you're just wasting cycles.

    - it's behaviour is inconsistent between SMT and !SMT hardware in
    that !SMT hardware ends up doing a scan for any idle CPU in the LLC
    domain, while SMT hardware does a scan for idle cores and if that
    fails, falls back to a scan for idle threads on the 'target' core.

    The new code replaces the sched_domain scan with 3 explicit scans:

    1) search for an idle core in the LLC
    2) search for an idle CPU in the LLC
    3) search for an idle thread in the 'target' core

    where 1 and 3 are conditional on SMT support and 1 and 2 have runtime
    heuristics to skip the step.

    Step 1) is conditional on sd_llc_shared->has_idle_cores; when a cpu
    goes idle and sd_llc_shared->has_idle_cores is false, we scan all SMT
    siblings of the CPU going idle. Similarly, we clear
    sd_llc_shared->has_idle_cores when we fail to find an idle core.

    Step 2) tracks the average cost of the scan and compares this to the
    average idle time guestimate for the CPU doing the wakeup. There is a
    significant fudge factor involved to deal with the variability of the
    averages. Esp. hackbench was sensitive to this.

    Step 3) is unconditional; we assume (also per step 1) that scanning
    all SMT siblings in a core is 'cheap'.

    With this; SMT systems gain step 2, which cures a few benchmarks --
    notably one from Facebook.

    One 'feature' of the sched_domain iteration, which we preserve in the
    new code, is that it would start scanning from the 'target' CPU,
    instead of scanning the cpumask in cpu id order. This avoids multiple
    CPUs in the LLC scanning for idle to gang up and find the same CPU
    quite as much. The down side is that tasks can end up hopping across
    the LLC for no apparent reason.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Move the nr_busy_cpus thing from its hacky sd->parent->groups->sgc
    location into the much more natural sched_domain_shared location.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Since struct sched_domain is strictly per cpu; introduce a structure
    that is shared between all 'identical' sched_domains.

    Limit to SD_SHARE_PKG_RESOURCES domains for now, as we'll only use it
    for shared cache state; if another use comes up later we can easily
    relax this.

    While the sched_group's are normally shared between CPUs, these are
    not natural to use when we need some shared state on a domain level --
    since that would require the domain to have a parent, which is not a
    given.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

22 Sep, 2016

2 commits

  • On fully preemptible kernels _cond_resched() is pointless, so avoid
    emitting any code for it.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mikulas Patocka
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Oleg noted that by making do_exit() use __schedule() for the TASK_DEAD
    context switch, we can avoid the TASK_DEAD special case currently in
    __schedule() because that avoids the extra preempt_disable() from
    schedule().

    In order to facilitate this, create a do_task_dead() helper which we
    place in the scheduler code, such that it can access __schedule().

    Also add some __noreturn annotations to the functions, there's no
    coming back from do_exit().

    Suggested-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Cheng Chao
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: akpm@linux-foundation.org
    Cc: chris@chris-wilson.co.uk
    Cc: tj@kernel.org
    Link: http://lkml.kernel.org/r/20160913163729.GB5012@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

16 Sep, 2016

2 commits

  • We currently keep every task's stack around until the task_struct
    itself is freed. This means that we keep the stack allocation alive
    for longer than necessary and that, under load, we free stacks in
    big batches whenever RCU drops the last task reference. Neither of
    these is good for reuse of cache-hot memory, and freeing in batches
    prevents us from usefully caching small numbers of vmalloced stacks.

    On architectures that have thread_info on the stack, we can't easily
    change this, but on architectures that set THREAD_INFO_IN_TASK, we
    can free it as soon as the task is dead.

    Signed-off-by: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jann Horn
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/08ca06cde00ebed0046c5d26cbbf3fbb7ef5b812.1474003868.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     
  • There are a few places in the kernel that access stack memory
    belonging to a different task. Before we can start freeing task
    stacks before the task_struct is freed, we need a way for those code
    paths to pin the stack.

    Signed-off-by: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jann Horn
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/17a434f50ad3d77000104f21666575e10a9c1fbd.1474003868.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

15 Sep, 2016

1 commit

  • If an arch opts in by setting CONFIG_THREAD_INFO_IN_TASK_STRUCT,
    then thread_info is defined as a single 'u32 flags' and is the first
    entry of task_struct. thread_info::task is removed (it serves no
    purpose if thread_info is embedded in task_struct), and
    thread_info::cpu gets its own slot in task_struct.

    This is heavily based on a patch written by Linus.

    Originally-from: Linus Torvalds
    Signed-off-by: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jann Horn
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/a0898196f0476195ca02713691a5037a14f2aac5.1473801993.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

14 Sep, 2016

1 commit

  • Testing indicates that it is possible to improve performace
    significantly without increasing energy consumption too much by
    teaching cpufreq governors to bump up the CPU performance level if
    the in_iowait flag is set for the task in enqueue_task_fair().

    For this purpose, define a new cpufreq_update_util() flag
    SCHED_CPUFREQ_IOWAIT and modify enqueue_task_fair() to pass that
    flag to cpufreq_update_util() in the in_iowait case. That generally
    requires cpufreq_update_util() to be called directly from there,
    because update_load_avg() may not be invoked in that case.

    Signed-off-by: Rafael J. Wysocki
    Looks-good-to: Steve Muckle
    Acked-by: Peter Zijlstra (Intel)

    Rafael J. Wysocki
     

24 Aug, 2016

1 commit

  • If CONFIG_VMAP_STACK=y is selected, kernel stacks are allocated with
    __vmalloc_node_range().

    Grsecurity has had a similar feature (called GRKERNSEC_KSTACKOVERFLOW=y)
    for a long time.

    Signed-off-by: Andy Lutomirski
    Acked-by: Michal Hocko
    Cc: Alexander Potapenko
    Cc: Andrey Ryabinin
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: Dmitry Vyukov
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/14c07d4fd173a5b117f51e8b939f9f4323e39899.1470907718.git.luto@kernel.org
    [ Minor edits. ]
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

18 Aug, 2016

1 commit

  • Add a topology flag to the sched_domain hierarchy indicating the lowest
    domain level where the full range of CPU capacities is represented by
    the domain members for asymmetric capacity topologies (e.g. ARM
    big.LITTLE).

    The flag is intended to indicate that extra care should be taken when
    placing tasks on CPUs and this level spans all the different types of
    CPUs found in the system (no need to look further up the domain
    hierarchy). This information is currently only available through
    iterating through the capacities of all the CPUs at parent levels in the
    sched_domain hierarchy.

    SD 2 [ 0 1 2 3] SD_ASYM_CPUCAPACITY

    SD 1 [ 0 1] [ 2 3] !SD_ASYM_CPUCAPACITY

    CPU: 0 1 2 3
    capacity: 756 756 1024 1024

    If the topology in the example above is duplicated to create an eight
    CPU example with third sched_domain level on top (SD 3), this level
    should not have the flag set (!SD_ASYM_CPUCAPACITY) as its two group
    would both have all CPU capacities represented within them.

    Signed-off-by: Morten Rasmussen
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dietmar.eggemann@arm.com
    Cc: freedom.tan@mediatek.com
    Cc: keita.kobayashi.ym@renesas.com
    Cc: mgalbraith@suse.de
    Cc: sgurrappadi@nvidia.com
    Cc: vincent.guittot@linaro.org
    Cc: yuyang.du@intel.com
    Link: http://lkml.kernel.org/r/1469453670-2660-6-git-send-email-morten.rasmussen@arm.com
    Signed-off-by: Ingo Molnar

    Morten Rasmussen
     

17 Aug, 2016

1 commit

  • It is useful to know the reason why cpufreq_update_util() has just
    been called and that can be passed as flags to cpufreq_update_util()
    and to the ->func() callback in struct update_util_data. However,
    doing that in addition to passing the util and max arguments they
    already take would be clumsy, so avoid it.

    Instead, use the observation that the schedutil governor is part
    of the scheduler proper, so it can access scheduler data directly.
    This allows the util and max arguments of cpufreq_update_util()
    and the ->func() callback in struct update_util_data to be replaced
    with a flags one, but schedutil has to be modified to follow.

    Thus make the schedutil governor obtain the CFS utilization
    information from the scheduler and use the "RT" and "DL" flags
    instead of the special utilization value of ULONG_MAX to track
    updates from the RT and DL sched classes. Make it non-modular
    too to avoid having to export scheduler variables to modules at
    large.

    Next, update all of the other users of cpufreq_update_util()
    and the ->func() callback in struct update_util_data accordingly.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Viresh Kumar

    Rafael J. Wysocki
     

10 Aug, 2016

2 commits

  • This message is currently really useless since it always prints a value
    that comes from the printk() we just did, e.g.:

    BUG: sleeping function called from invalid context at mm/slab.h:388
    in_atomic(): 0, irqs_disabled(): 0, pid: 31996, name: trinity-c1
    Preemption disabled at:[] down_trylock+0x13/0x80

    BUG: sleeping function called from invalid context at include/linux/freezer.h:56
    in_atomic(): 0, irqs_disabled(): 0, pid: 31996, name: trinity-c1
    Preemption disabled at:[] console_unlock+0x2f7/0x930

    Here, both down_trylock() and console_unlock() is somewhere in the
    printk() path.

    We should save the value before calling printk() and use the saved value
    instead. That immediately reveals the offending callsite:

    BUG: sleeping function called from invalid context at mm/slab.h:388
    in_atomic(): 0, irqs_disabled(): 0, pid: 14971, name: trinity-c2
    Preemption disabled at:[] rhashtable_walk_start+0x46/0x150

    Bug report:

    http://marc.info/?l=linux-netdev&m=146925979821849&w=2

    Signed-off-by: Vegard Nossum
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rusty Russel
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Vegard Nossum
     
  • It is seems that this one escaped Nico's renaming of cpu_power to
    cpu_capacity a while back.

    Signed-off-by: Morten Rasmussen
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dietmar.eggemann@arm.com
    Cc: linux-kernel@vger.kernel.org
    Cc: mgalbraith@suse.de
    Cc: vincent.guittot@linaro.org
    Cc: yuyang.du@intel.com
    Link: http://lkml.kernel.org/r/1466615004-3503-2-git-send-email-morten.rasmussen@arm.com
    Signed-off-by: Ingo Molnar

    Morten Rasmussen
     

03 Aug, 2016

1 commit

  • In general, there's no need for the "restore sigmask" flag to live in
    ti->flags. alpha, ia64, microblaze, powerpc, sh, sparc (64-bit only),
    tile, and x86 use essentially identical alternative implementations,
    placing the flag in ti->status.

    Replace those optimized implementations with an equally good common
    implementation that stores it in a bitfield in struct task_struct and
    drop the custom implementations.

    Additional architectures can opt in by removing their
    TIF_RESTORE_SIGMASK defines.

    Link: http://lkml.kernel.org/r/8a14321d64a28e40adfddc90e18a96c086a6d6f9.1468522723.git.luto@kernel.org
    Signed-off-by: Andy Lutomirski
    Tested-by: Michael Ellerman [powerpc]
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Michal Simek
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dmitry Safonov
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     

29 Jul, 2016

2 commits

  • oom_reaper relies on the mmap_sem for read to do its job. Many places
    which might block readers have been converted to use down_write_killable
    and that has reduced chances of the contention a lot. Some paths where
    the mmap_sem is held for write can take other locks and they might
    either be not prepared to fail due to fatal signal pending or too
    impractical to be changed.

    This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the
    first attempt to reap a task's mm fails. If the flag is present after
    the failure then we set MMF_OOM_REAPED to hide this mm from the oom
    killer completely so it can go and chose another victim.

    As a result a risk of OOM deadlock when the oom victim would be blocked
    indefinetly and so the oom killer cannot make any progress should be
    mitigated considerably while we still try really hard to perform all
    reclaim attempts and stay predictable in the behavior.

    Link: http://lkml.kernel.org/r/1466426628-15074-10-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Vladimir Davydov
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • vforked tasks are not really sitting on any memory. They are sharing the
    mm with parent until they exec into a new code. Until then it is just
    pinning the address space. OOM killer will kill the vforked task along
    with its parent but we still can end up selecting vforked task when the
    parent wouldn't be selected. E.g. init doing vfork to launch a task or
    vforked being a child of oom unkillable task with an updated oom_score_adj
    to be killable.

    Add a new helper to check whether a task is in the vfork sharing memory
    with its parent and use it in oom_badness to skip over these tasks.

    Link: http://lkml.kernel.org/r/1466426628-15074-6-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Vladimir Davydov
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

27 Jun, 2016

2 commits

  • Vincent and Yuyang found another few scenarios in which entity
    tracking goes wobbly.

    The scenarios are basically due to the fact that new tasks are not
    immediately attached and thereby differ from the normal situation -- a
    task is always attached to a cfs_rq load average (such that it
    includes its blocked contribution) and are explicitly
    detached/attached on migration to another cfs_rq.

    Scenario 1: switch to fair class

    p->sched_class = fair_class;
    if (queued)
    enqueue_task(p);
    ...
    enqueue_entity()
    enqueue_entity_load_avg()
    migrated = !sa->last_update_time (true)
    if (migrated)
    attach_entity_load_avg()
    check_class_changed()
    switched_from() (!fair)
    switched_to() (fair)
    switched_to_fair()
    attach_entity_load_avg()

    If @p is a new task that hasn't been fair before, it will have
    !last_update_time and, per the above, end up in
    attach_entity_load_avg() _twice_.

    Scenario 2: change between cgroups

    sched_move_group(p)
    if (queued)
    dequeue_task()
    task_move_group_fair()
    detach_task_cfs_rq()
    detach_entity_load_avg()
    set_task_rq()
    attach_task_cfs_rq()
    attach_entity_load_avg()
    if (queued)
    enqueue_task();
    ...
    enqueue_entity()
    enqueue_entity_load_avg()
    migrated = !sa->last_update_time (true)
    if (migrated)
    attach_entity_load_avg()

    Similar as with scenario 1, if @p is a new task, it will have
    !load_update_time and we'll end up in attach_entity_load_avg()
    _twice_.

    Furthermore, notice how we do a detach_entity_load_avg() on something
    that wasn't attached to begin with.

    As stated above; the problem is that the new task isn't yet attached
    to the load tracking and thereby violates the invariant assumption.

    This patch remedies this by ensuring a new task is indeed properly
    attached to the load tracking on creation, through
    post_init_entity_util_avg().

    Of course, this isn't entirely as straightforward as one might think,
    since the task is hashed before we call wake_up_new_task() and thus
    can be poked at. We avoid this by adding TASK_NEW and teaching
    cpu_cgroup_can_attach() to refuse such tasks.

    Reported-by: Yuyang Du
    Reported-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     

25 Jun, 2016

1 commit

  • We've had the thread info allocated together with the thread stack for
    most architectures for a long time (since the thread_info was split off
    from the task struct), but that is about to change.

    But the patches that move the thread info to be off-stack (and a part of
    the task struct instead) made it clear how confused the allocator and
    freeing functions are.

    Because the common case was that we share an allocation with the thread
    stack and the thread_info, the two pointers were identical. That
    identity then meant that we would have things like

    ti = alloc_thread_info_node(tsk, node);
    ...
    tsk->stack = ti;

    which certainly _worked_ (since stack and thread_info have the same
    value), but is rather confusing: why are we assigning a thread_info to
    the stack? And if we move the thread_info away, the "confusing" code
    just gets to be entirely bogus.

    So remove all this confusion, and make it clear that we are doing the
    stack allocation by renaming and clarifying the function names to be
    about the stack. The fact that the thread_info then shares the
    allocation is an implementation detail, and not really about the
    allocation itself.

    This is a pure renaming and type fix: we pass in the same pointer, it's
    just that we clarify what the pointer means.

    The ia64 code that actually only has one single allocation (for all of
    task_struct, thread_info and kernel thread stack) now looks a bit odd,
    but since "tsk->stack" is actually not even used there, that oddity
    doesn't matter. It would be a separate thing to clean that up, I
    intentionally left the ia64 changes as a pure brute-force renaming and
    type change.

    Acked-by: Andy Lutomirski
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

03 Jun, 2016

1 commit

  • Generally task_struct is only protected by RCU if it was found on a
    RCU protected list (say, for_each_process() or find_task_by_vpid()).

    As Kirill pointed out rq->curr isn't protected by RCU, the scheduler
    drops the (potentially) last reference without RCU gp, this means
    that we need to fix the code which uses foreign_rq->curr under
    rcu_read_lock().

    Add a new helper which can be used to dereference rq->curr or any
    other pointer to task_struct assuming that it should be cleared or
    updated before the final put_task_struct(). It returns non-NULL
    only if this task can't go away before rcu_read_unlock().

    ( Also add try_get_task_struct() to make it easier to use this API
    correctly. )

    Suggested-by: Kirill Tkhai
    Signed-off-by: Oleg Nesterov
    [ Updated comments; added try_get_task_struct()]
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Vladimir Davydov
    Link: http://lkml.kernel.org/r/20160518170218.GY3192@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

27 May, 2016

1 commit

  • mmput_async is currently used only from the oom_reaper which is defined
    only for CONFIG_MMU. We can save work_struct in mm_struct for
    !CONFIG_MMU.

    [akpm@linux-foundation.org: fix typo, per Minchan]
    Link: http://lkml.kernel.org/r/20160520061658.GB19172@dhcp22.suse.cz
    Reported-by: Minchan Kim
    Signed-off-by: Michal Hocko
    Acked-by: Minchan Kim
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

26 May, 2016

1 commit


25 May, 2016

1 commit

  • Commit:

    b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on migration")

    ... introduced a bug: Mike Galbraith found that it introduced a
    performance regression, while Paul E. McKenney reported lost
    wakeups and bisected it to this commit.

    The reason is that I mis-read ttwu_queue() such that I assumed any
    wakeup that got a remote queue must have had the task migrated.

    Since this is not so; we need to transfer this information between
    queueing the wakeup and actually doing the wakeup. Use a new
    task_struct::sched_flag for this, we already write to
    sched_contributes_to_load in the wakeup path so this is a hot and
    modified cacheline.

    Reported-by: Paul E. McKenney
    Reported-by: Mike Galbraith
    Tested-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Hunter
    Cc: Andy Lutomirski
    Cc: Ben Segall
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: Fenghua Yu
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Matt Fleming
    Cc: Morten Rasmussen
    Cc: Oleg Nesterov
    Cc: Paul Turner
    Cc: Pavan Kondeti
    Cc: Peter Zijlstra
    Cc: Quentin Casasnovas
    Cc: Thomas Gleixner
    Cc: byungchul.park@lge.com
    Fixes: b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on migration")
    Link: http://lkml.kernel.org/r/20160523091907.GD15728@worktop.ger.corp.intel.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

24 May, 2016

1 commit

  • Currently the size of "struct signal_struct"->oom_flags member is
    sizeof(unsigned) bytes, but only one flag OOM_FLAG_ORIGIN which is
    updated by current thread is defined. We can convert OOM_FLAG_ORIGIN
    into a bool, and reuse the saved bytes for updating from the OOM killer
    and/or the OOM reaper thread.

    By the way, do we care about a race window between run_store() and
    swapoff() because it would be theoretically possible that two threads
    sharing the "struct signal_struct" concurrently call respective
    functions? If we care, we can make oom_flags an atomic_t.

    Signed-off-by: Tetsuo Handa
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

21 May, 2016

1 commit

  • Merge more updates from Andrew Morton:

    - the rest of MM

    - KASAN updates

    - procfs updates

    - exit, fork updates

    - printk updates

    - lib/ updates

    - radix-tree testsuite updates

    - checkpatch updates

    - kprobes updates

    - a few other misc bits

    * emailed patches from Andrew Morton : (162 commits)
    samples/kprobes: print out the symbol name for the hooks
    samples/kprobes: add a new module parameter
    kprobes: add the "tls" argument for j_do_fork
    init/main.c: simplify initcall_blacklisted()
    fs/efs/super.c: fix return value
    checkpatch: improve --git shortcut
    checkpatch: reduce number of `git log` calls with --git
    checkpatch: add support to check already applied git commits
    checkpatch: add --list-types to show message types to show or ignore
    checkpatch: advertise the --fix and --fix-inplace options more
    checkpatch: whine about ACCESS_ONCE
    checkpatch: add test for keywords not starting on tabstops
    checkpatch: improve CONSTANT_COMPARISON test for structure members
    checkpatch: add PREFER_IS_ENABLED test
    lib/GCD.c: use binary GCD algorithm instead of Euclidean
    radix-tree: free up the bottom bit of exceptional entries for reuse
    dax: move RADIX_DAX_ definitions to dax.c
    radix-tree: make radix_tree_descend() more useful
    radix-tree: introduce radix_tree_replace_clear_tags()
    radix-tree: tidy up __radix_tree_create()
    ...

    Linus Torvalds