15 Dec, 2016

1 commit

  • Pull namespace updates from Eric Biederman:
    "After a lot of discussion and work we have finally reachanged a basic
    understanding of what is necessary to make unprivileged mounts safe in
    the presence of EVM and IMA xattrs which the last commit in this
    series reflects. While technically it is a revert the comments it adds
    are important for people not getting confused in the future. Clearing
    up that confusion allows us to seriously work on unprivileged mounts
    of fuse in the next development cycle.

    The rest of the fixes in this set are in the intersection of user
    namespaces, ptrace, and exec. I started with the first fix which
    started a feedback cycle of finding additional issues during review
    and fixing them. Culiminating in a fix for a bug that has been present
    since at least Linux v1.0.

    Potentially these fixes were candidates for being merged during the rc
    cycle, and are certainly backport candidates but enough little things
    turned up during review and testing that I decided they should be
    handled as part of the normal development process just to be certain
    there were not any great surprises when it came time to backport some
    of these fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    Revert "evm: Translate user/group ids relative to s_user_ns when computing HMAC"
    exec: Ensure mm->user_ns contains the execed files
    ptrace: Don't allow accessing an undumpable mm
    ptrace: Capture the ptracer's creds not PT_PTRACE_CAP
    mm: Add a user_ns owner to mm_struct and fix ptrace permission checks

    Linus Torvalds
     

14 Dec, 2016

1 commit

  • Pull power management updates from Rafael Wysocki:
    "Again, cpufreq gets more changes than the other parts this time (one
    new driver, one old driver less, a bunch of enhancements of the
    existing code, new CPU IDs, fixes, cleanups)

    There also are some changes in cpuidle (idle injection rework, a
    couple of new CPU IDs, online/offline rework in intel_idle, fixes and
    cleanups), in the generic power domains framework (mostly related to
    supporting power domains containing CPUs), and in the Operating
    Performance Points (OPP) library (mostly related to supporting devices
    with multiple voltage regulators)

    In addition to that, the system sleep state selection interface is
    modified to make it easier for distributions with unchanged user space
    to support suspend-to-idle as the default system suspend method, some
    issues are fixed in the PM core, the latency tolerance PM QoS
    framework is improved a bit, the Intel RAPL power capping driver is
    cleaned up and there are some fixes and cleanups in the devfreq
    subsystem

    Specifics:

    - New cpufreq driver for Broadcom STB SoCs and a Device Tree binding
    for it (Markus Mayer)

    - Support for ARM Integrator/AP and Integrator/CP in the generic DT
    cpufreq driver and elimination of the old Integrator cpufreq driver
    (Linus Walleij)

    - Support for the zx296718, r8a7743 and r8a7745, Socionext UniPhier,
    and PXA SoCs in the the generic DT cpufreq driver (Baoyou Xie,
    Geert Uytterhoeven, Masahiro Yamada, Robert Jarzmik)

    - cpufreq core fix to eliminate races that may lead to using inactive
    policy objects and related cleanups (Rafael Wysocki)

    - cpufreq schedutil governor update to make it use SCHED_FIFO kernel
    threads (instead of regular workqueues) for doing delayed work (to
    reduce the response latency in some cases) and related cleanups
    (Viresh Kumar)

    - New cpufreq sysfs attribute for resetting statistics (Markus Mayer)

    - cpufreq governors fixes and cleanups (Chen Yu, Stratos Karafotis,
    Viresh Kumar)

    - Support for using generic cpufreq governors in the intel_pstate
    driver (Rafael Wysocki)

    - Support for per-logical-CPU P-state limits and the EPP/EPB (Energy
    Performance Preference/Energy Performance Bias) knobs in the
    intel_pstate driver (Srinivas Pandruvada)

    - New CPU ID for Knights Mill in intel_pstate (Piotr Luc)

    - intel_pstate driver modification to use the P-state selection
    algorithm based on CPU load on platforms with the system profile in
    the ACPI tables set to "mobile" (Srinivas Pandruvada)

    - intel_pstate driver cleanups (Arnd Bergmann, Rafael Wysocki,
    Srinivas Pandruvada)

    - cpufreq powernv driver updates including fast switching support
    (for the schedutil governor), fixes and cleanus (Akshay Adiga,
    Andrew Donnellan, Denis Kirjanov)

    - acpi-cpufreq driver rework to switch it over to the new CPU
    offline/online state machine (Sebastian Andrzej Siewior)

    - Assorted cleanups in cpufreq drivers (Wei Yongjun, Prashanth
    Prakash)

    - Idle injection rework (to make it use the regular idle path instead
    of a home-grown custom one) and related powerclamp thermal driver
    updates (Peter Zijlstra, Jacob Pan, Petr Mladek, Sebastian Andrzej
    Siewior)

    - New CPU IDs for Atom Z34xx and Knights Mill in intel_idle (Andy
    Shevchenko, Piotr Luc)

    - intel_idle driver cleanups and switch over to using the new CPU
    offline/online state machine (Anna-Maria Gleixner, Sebastian
    Andrzej Siewior)

    - cpuidle DT driver update to support suspend-to-idle properly
    (Sudeep Holla)

    - cpuidle core cleanups and misc updates (Daniel Lezcano, Pan Bian,
    Rafael Wysocki)

    - Preliminary support for power domains including CPUs in the generic
    power domains (genpd) framework and related DT bindings (Lina Iyer)

    - Assorted fixes and cleanups in the generic power domains (genpd)
    framework (Colin Ian King, Dan Carpenter, Geert Uytterhoeven)

    - Preliminary support for devices with multiple voltage regulators
    and related fixes and cleanups in the Operating Performance Points
    (OPP) library (Viresh Kumar, Masahiro Yamada, Stephen Boyd)

    - System sleep state selection interface rework to make it easier to
    support suspend-to-idle as the default system suspend method
    (Rafael Wysocki)

    - PM core fixes and cleanups, mostly related to the interactions
    between the system suspend and runtime PM frameworks (Ulf Hansson,
    Sahitya Tummala, Tony Lindgren)

    - Latency tolerance PM QoS framework imorovements (Andrew Lutomirski)

    - New Knights Mill CPU ID for the Intel RAPL power capping driver
    (Piotr Luc)

    - Intel RAPL power capping driver fixes, cleanups and switch over to
    using the new CPU offline/online state machine (Jacob Pan, Thomas
    Gleixner, Sebastian Andrzej Siewior)

    - Fixes and cleanups in the exynos-ppmu, exynos-nocp, rk3399_dmc,
    rockchip-dfi devfreq drivers and the devfreq core (Axel Lin,
    Chanwoo Choi, Javier Martinez Canillas, MyungJoo Ham, Viresh Kumar)

    - Fix for false-positive KASAN warnings during resume from ACPI S3
    (suspend-to-RAM) on x86 (Josh Poimboeuf)

    - Memory map verification during resume from hibernation on x86 to
    ensure a consistent address space layout (Chen Yu)

    - Wakeup sources debugging enhancement (Xing Wei)

    - rockchip-io AVS driver cleanup (Shawn Lin)"

    * tag 'pm-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (127 commits)
    devfreq: rk3399_dmc: Don't use OPP structures outside of RCU locks
    devfreq: rk3399_dmc: Remove dangling rcu_read_unlock()
    devfreq: exynos: Don't use OPP structures outside of RCU locks
    Documentation: intel_pstate: Document HWP energy/performance hints
    cpufreq: intel_pstate: Support for energy performance hints with HWP
    cpufreq: intel_pstate: Add locking around HWP requests
    PM / sleep: Print active wakeup sources when blocking on wakeup_count reads
    PM / core: Fix bug in the error handling of async suspend
    PM / wakeirq: Fix dedicated wakeirq for drivers not using autosuspend
    PM / Domains: Fix compatible for domain idle state
    PM / OPP: Don't WARN on multiple calls to dev_pm_opp_set_regulators()
    PM / OPP: Allow platform specific custom set_opp() callbacks
    PM / OPP: Separate out _generic_set_opp()
    PM / OPP: Add infrastructure to manage multiple regulators
    PM / OPP: Pass struct dev_pm_opp_supply to _set_opp_voltage()
    PM / OPP: Manage supply's voltage/current in a separate structure
    PM / OPP: Don't use OPP structure outside of rcu protected section
    PM / OPP: Reword binding supporting multiple regulators per device
    PM / OPP: Fix incorrect cpu-supply property in binding
    cpuidle: Add a kerneldoc comment to cpuidle_use_deepest_state()
    ..

    Linus Torvalds
     

13 Dec, 2016

2 commits

  • This limitation came with the reason to remove "another way for
    malicious code to obscure a compromised program and masquerade as a
    benign process" by allowing "security-concious program can use this
    prctl once during its early initialization to ensure the prctl cannot
    later be abused for this purpose":

    http://marc.info/?l=linux-kernel&m=133160684517468&w=2

    This explanation doesn't look sufficient. The only thing "exe" link is
    indicating is the file, used to execve, which is basically nothing and
    not reliable immediately after process has returned from execve system
    call.

    Moreover, to use this feture, all the mappings to previous exe file have
    to be unmapped and all the new exe file permissions must be satisfied.

    Which means, that changing exe link is very similar to calling execve on
    the binary.

    The need to remove this limitations comes from migration of NFS mount
    point, which is not accessible during restore and replaced by other file
    system. Because of this exe link has to be changed twice.

    [akpm@linux-foundation.org: fix up comment]
    Link: http://lkml.kernel.org/r/20160927153755.9337.69650.stgit@localhost.localdomain
    Signed-off-by: Stanislav Kinsburskiy
    Acked-by: Oleg Nesterov
    Acked-by: Cyrill Gorcunov
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Michal Hocko
    Cc: Kees Cook
    Cc: Andy Lutomirski
    Cc: John Stultz
    Cc: Matt Helsley
    Cc: Pavel Emelyanov
    Cc: Vlastimil Babka
    Cc: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stanislav Kinsburskiy
     
  • Pull scheduler updates from Ingo Molnar:
    "The main scheduler changes in this cycle were:

    - support Intel Turbo Boost Max Technology 3.0 (TBM3) by introducig a
    notion of 'better cores', which the scheduler will prefer to
    schedule single threaded workloads on. (Tim Chen, Srinivas
    Pandruvada)

    - enhance the handling of asymmetric capacity CPUs further (Morten
    Rasmussen)

    - improve/fix load handling when moving tasks between task groups
    (Vincent Guittot)

    - simplify and clean up the cputime code (Stanislaw Gruszka)

    - improve mass fork()ed task spread a.k.a. hackbench speedup (Vincent
    Guittot)

    - make struct kthread kmalloc()ed and related fixes (Oleg Nesterov)

    - add uaccess atomicity debugging (when using access_ok() in the
    wrong context), under CONFIG_DEBUG_ATOMIC_SLEEP=y (Peter Zijlstra)

    - implement various fixes, cleanups and other enhancements (Daniel
    Bristot de Oliveira, Martin Schwidefsky, Rafael J. Wysocki)"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
    sched/core: Use load_avg for selecting idlest group
    sched/core: Fix find_idlest_group() for fork
    kthread: Don't abuse kthread_create_on_cpu() in __kthread_create_worker()
    kthread: Don't use to_live_kthread() in kthread_[un]park()
    kthread: Don't use to_live_kthread() in kthread_stop()
    Revert "kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function"
    kthread: Make struct kthread kmalloc'ed
    x86/uaccess, sched/preempt: Verify access_ok() context
    sched/x86: Make CONFIG_SCHED_MC_PRIO=y easier to enable
    sched/x86: Change CONFIG_SCHED_ITMT to CONFIG_SCHED_MC_PRIO
    x86/sched: Use #include instead of #include
    cpufreq/intel_pstate: Use CPPC to get max performance
    acpi/bus: Set _OSC for diverse core support
    acpi/bus: Enable HWP CPPC objects
    x86/sched: Add SD_ASYM_PACKING flags to x86 ITMT CPU
    x86/sysctl: Add sysctl for ITMT scheduling feature
    x86: Enable Intel Turbo Boost Max Technology 3.0
    x86/topology: Define x86's arch_update_cpu_topology
    sched: Extend scheduler's asym packing
    sched/fair: Clean up the tunable parameter definitions
    ...

    Linus Torvalds
     

02 Dec, 2016

1 commit


01 Dec, 2016

1 commit


29 Nov, 2016

1 commit

  • Idle injection drivers such as Intel powerclamp and ACPI PAD drivers use
    realtime tasks to take control of CPU then inject idle. There are two
    issues with this approach:

    1. Low efficiency: injected idle task is treated as busy so sched ticks
    do not stop during injected idle period, the result of these
    unwanted wakeups can be ~20% loss in power savings.

    2. Idle accounting: injected idle time is presented to user as busy.

    This patch addresses the issues by introducing a new PF_IDLE flag which
    allows any given task to be treated as idle task while the flag is set.
    Therefore, idle injection tasks can run through the normal flow of NOHZ
    idle enter/exit to get the correct accounting as well as tick stop when
    possible.

    The implication is that idle task is then no longer limited to PID == 0.

    Acked-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Jacob Pan
    Signed-off-by: Rafael J. Wysocki

    Peter Zijlstra
     

24 Nov, 2016

1 commit

  • We generalize the scheduler's asym packing to provide an ordering
    of the cpu beyond just the cpu number. This allows the use of the
    ASYM_PACKING scheduler machinery to move loads to preferred CPU in a
    sched domain. The preference is defined with the cpu priority
    given by arch_asym_cpu_priority(cpu).

    We also record the most preferred cpu in a sched group when
    we build the cpu's capacity for fast lookup of preferred cpu
    during load balancing.

    Co-developed-by: Peter Zijlstra (Intel)
    Signed-off-by: Tim Chen
    Acked-by: Peter Zijlstra (Intel)
    Cc: linux-pm@vger.kernel.org
    Cc: jolsa@redhat.com
    Cc: rjw@rjwysocki.net
    Cc: linux-acpi@vger.kernel.org
    Cc: Srinivas Pandruvada
    Cc: bp@suse.de
    Link: http://lkml.kernel.org/r/0e73ae12737dfaafa46c07066cc7c5d3f1675e46.1479844244.git.tim.c.chen@linux.intel.com
    Signed-off-by: Thomas Gleixner

    Tim Chen
     

23 Nov, 2016

2 commits

  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • When the flag PT_PTRACE_CAP was added the PTRACE_TRACEME path was
    overlooked. This can result in incorrect behavior when an application
    like strace traces an exec of a setuid executable.

    Further PT_PTRACE_CAP does not have enough information for making good
    security decisions as it does not report which user namespace the
    capability is in. This has already allowed one mistake through
    insufficient granulariy.

    I found this issue when I was testing another corner case of exec and
    discovered that I could not get strace to set PT_PTRACE_CAP even when
    running strace as root with a full set of caps.

    This change fixes the above issue with strace allowing stracing as
    root a setuid executable without disabling setuid. More fundamentaly
    this change allows what is allowable at all times, by using the correct
    information in it's decision.

    Cc: stable@vger.kernel.org
    Fixes: 4214e42f96d4 ("v2.4.9.11 -> v2.4.9.12")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

22 Nov, 2016

2 commits

  • This patch is the first step to add support to improve lock holder
    preemption beaviour.

    vcpu_is_preempted(cpu) does the obvious thing: it tells us whether a
    vCPU is preempted or not.

    Defaults to false on architectures that don't support it.

    Suggested-by: Peter Zijlstra (Intel)
    Tested-by: Juergen Gross
    Signed-off-by: Pan Xinhui
    Signed-off-by: Peter Zijlstra (Intel)
    [ Translated the changelog to English. ]
    Acked-by: Christian Borntraeger
    Acked-by: Paolo Bonzini
    Cc: David.Laight@ACULAB.COM
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: benh@kernel.crashing.org
    Cc: boqun.feng@gmail.com
    Cc: bsingharora@gmail.com
    Cc: dave@stgolabs.net
    Cc: kernellwp@gmail.com
    Cc: konrad.wilk@oracle.com
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: mpe@ellerman.id.au
    Cc: paulmck@linux.vnet.ibm.com
    Cc: paulus@samba.org
    Cc: rkrcmar@redhat.com
    Cc: virtualization@lists.linux-foundation.org
    Cc: will.deacon@arm.com
    Cc: xen-devel-request@lists.xenproject.org
    Cc: xen-devel@lists.xenproject.org
    Link: http://lkml.kernel.org/r/1478077718-37424-2-git-send-email-xinhui.pan@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Pan Xinhui
     
  • Exactly because for_each_thread() in autogroup_move_group() can't see it
    and update its ->sched_task_group before _put() and possibly free().

    So the exiting task needs another sched_move_task() before exit_notify()
    and we need to re-introduce the PF_EXITING (or similar) check removed by
    the previous change for another reason.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: hartsjc@redhat.com
    Cc: vbendel@redhat.com
    Cc: vlovejoy@redhat.com
    Link: http://lkml.kernel.org/r/20161114184612.GA15968@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

21 Nov, 2016

1 commit

  • Currently the wake_q data structure is defined by the WAKE_Q() macro.
    This macro, however, looks like a function doing something as "wake" is
    a verb. Even checkpatch.pl was confused as it reported warnings like

    WARNING: Missing a blank line after declarations
    #548: FILE: kernel/futex.c:3665:
    + int ret;
    + WAKE_Q(wake_q);

    This patch renames the WAKE_Q() macro to DEFINE_WAKE_Q() which clarifies
    what the macro is doing and eliminates the checkpatch.pl warnings.

    Signed-off-by: Waiman Long
    Acked-by: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1479401198-1765-1-git-send-email-longman@redhat.com
    [ Resolved conflict and added missing rename. ]
    Signed-off-by: Ingo Molnar

    Waiman Long
     

17 Nov, 2016

1 commit

  • No need to duplicate the same define everywhere. Since
    the only user is stop-machine and the only provider is
    s390, we can use a default implementation of cpu_relax_yield()
    in sched.h.

    Suggested-by: Russell King
    Signed-off-by: Christian Borntraeger
    Reviewed-by: David Hildenbrand
    Acked-by: Russell King
    Cc: Andrew Morton
    Cc: Catalin Marinas
    Cc: Heiko Carstens
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Nicholas Piggin
    Cc: Noam Camus
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Cc: kvm@vger.kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-s390
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: sparclinux@vger.kernel.org
    Cc: virtualization@lists.linux-foundation.org
    Cc: xen-devel@lists.xenproject.org
    Link: http://lkml.kernel.org/r/1479298985-191589-1-git-send-email-borntraeger@de.ibm.com
    Signed-off-by: Ingo Molnar

    Christian Borntraeger
     

15 Nov, 2016

2 commits

  • Now since fetch_task_cputime() has no other users than task_cputime(),
    its code could be used directly in task_cputime().

    Moreover since only 2 task_cputime() calls of 17 use a NULL argument,
    we can add dummy variables to those calls and remove NULL checks from
    task_cputimes().

    Also remove NULL checks from task_cputimes_scaled().

    Signed-off-by: Stanislaw Gruszka
    Signed-off-by: Frederic Weisbecker
    Cc: Benjamin Herrenschmidt
    Cc: Heiko Carstens
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Michael Neuling
    Cc: Paul Mackerras
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1479175612-14718-5-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Stanislaw Gruszka
     
  • Only s390 and powerpc have hardware facilities allowing to measure
    cputimes scaled by frequency. On all other architectures
    utimescaled/stimescaled are equal to utime/stime (however they are
    accounted separately).

    Remove {u,s}timescaled accounting on all architectures except
    powerpc and s390, where those values are explicitly accounted
    in the proper places.

    Signed-off-by: Stanislaw Gruszka
    Signed-off-by: Frederic Weisbecker
    Cc: Benjamin Herrenschmidt
    Cc: Heiko Carstens
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Michael Neuling
    Cc: Paul Mackerras
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20161031162143.GB12646@redhat.com
    Signed-off-by: Ingo Molnar

    Stanislaw Gruszka
     

25 Oct, 2016

1 commit


08 Oct, 2016

6 commits

  • The global zero page is used to satisfy an anonymous read fault. If
    THP(Transparent HugePage) is enabled then the global huge zero page is
    used. The global huge zero page uses an atomic counter for reference
    counting and is allocated/freed dynamically according to its counter
    value.

    CPU time spent on that counter will greatly increase if there are a lot
    of processes doing anonymous read faults. This patch proposes a way to
    reduce the access to the global counter so that the CPU load can be
    reduced accordingly.

    To do this, a new flag of the mm_struct is introduced:
    MMF_USED_HUGE_ZERO_PAGE. With this flag, the process only need to touch
    the global counter in two cases:

    1 The first time it uses the global huge zero page;
    2 The time when mm_user of its mm_struct reaches zero.

    Note that right now, the huge zero page is eligible to be freed as soon
    as its last use goes away. With this patch, the page will not be
    eligible to be freed until the exit of the last process from which it
    was ever used.

    And with the use of mm_user, the kthread is not eligible to use huge
    zero page either. Since no kthread is using huge zero page today, there
    is no difference after applying this patch. But if that is not desired,
    I can change it to when mm_count reaches zero.

    Case used for test on Haswell EP:

    usemem -n 72 --readonly -j 0x200000 100G

    Which spawns 72 processes and each will mmap 100G anonymous space and
    then do read only access to that space sequentially with a step of 2MB.

    CPU cycles from perf report for base commit:
    54.03% usemem [kernel.kallsyms] [k] get_huge_zero_page
    CPU cycles from perf report for this commit:
    0.11% usemem [kernel.kallsyms] [k] mm_get_huge_zero_page

    Performance(throughput) of the workload for base commit: 1784430792
    Performance(throughput) of the workload for this commit: 4726928591
    164% increase.

    Runtime of the workload for base commit: 707592 us
    Runtime of the workload for this commit: 303970 us
    50% drop.

    Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com
    Signed-off-by: Aaron Lu
    Cc: Sergey Senozhatsky
    Cc: "Kirill A. Shutemov"
    Cc: Dave Hansen
    Cc: Tim Chen
    Cc: Huang Ying
    Cc: Vlastimil Babka
    Cc: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Ebru Akagunduz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Lu
     
  • There are only few use_mm() users in the kernel right now. Most of them
    write to the target memory but vhost driver relies on
    copy_from_user/get_user from a kernel thread context. This makes it
    impossible to reap the memory of an oom victim which shares the mm with
    the vhost kernel thread because it could see a zero page unexpectedly
    and theoretically make an incorrect decision visible outside of the
    killed task context.

    To quote Michael S. Tsirkin:
    : Getting an error from __get_user and friends is handled gracefully.
    : Getting zero instead of a real value will cause userspace
    : memory corruption.

    The vhost kernel thread is bound to an open fd of the vhost device which
    is not tight to the mm owner life cycle in general. The device fd can
    be inherited or passed over to another process which means that we
    really have to be careful about unexpected memory corruption because
    unlike for normal oom victims the result will be visible outside of the
    oom victim context.

    Make sure that no kthread context (users of use_mm) can ever see
    corrupted data because of the oom reaper and hook into the page fault
    path by checking MMF_UNSTABLE mm flag. __oom_reap_task_mm will set the
    flag before it starts unmapping the address space while the flag is
    checked after the page fault has been handled. If the flag is set then
    SIGBUS is triggered so any g-u-p user will get a error code.

    Regular tasks do not need this protection because all which share the mm
    are killed when the mm is reaped and so the corruption will not outlive
    them.

    This patch shouldn't have any visible effect at this moment because the
    OOM killer doesn't invoke oom reaper for tasks with mm shared with
    kthreads yet.

    Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: "Michael S. Tsirkin"
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • After "oom: keep mm of the killed task available" we can safely detect
    an oom victim by checking task->signal->oom_mm so we do not need the
    signal_struct counter anymore so let's get rid of it.

    This alone wouldn't be sufficient for nommu archs because
    exit_oom_victim doesn't hide the process from the oom killer anymore.
    We can, however, mark the mm with a MMF flag in __mmput. We can reuse
    MMF_OOM_REAPED and rename it to a more generic MMF_OOM_SKIP.

    Link: http://lkml.kernel.org/r/1472119394-11342-6-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Lockdep complains that __mmdrop is not safe from the softirq context:

    =================================
    [ INFO: inconsistent lock state ]
    4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949 Tainted: G W
    ---------------------------------
    inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
    swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
    (pgd_lock){+.?...}, at: pgd_free+0x19/0x6b
    {SOFTIRQ-ON-W} state was registered at:
    __lock_acquire+0xa06/0x196e
    lock_acquire+0x139/0x1e1
    _raw_spin_lock+0x32/0x41
    __change_page_attr_set_clr+0x2a5/0xacd
    change_page_attr_set_clr+0x16f/0x32c
    set_memory_nx+0x37/0x3a
    free_init_pages+0x9e/0xc7
    alternative_instructions+0xa2/0xb3
    check_bugs+0xe/0x2d
    start_kernel+0x3ce/0x3ea
    x86_64_start_reservations+0x2a/0x2c
    x86_64_start_kernel+0x17a/0x18d
    irq event stamp: 105916
    hardirqs last enabled at (105916): free_hot_cold_page+0x37e/0x390
    hardirqs last disabled at (105915): free_hot_cold_page+0x2c1/0x390
    softirqs last enabled at (105878): _local_bh_enable+0x42/0x44
    softirqs last disabled at (105879): irq_exit+0x6f/0xd1

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(pgd_lock);

    lock(pgd_lock);

    *** DEADLOCK ***

    1 lock held by swapper/1/0:
    #0: (rcu_callback){......}, at: rcu_process_callbacks+0x390/0x800

    stack backtrace:
    CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W 4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
    Call Trace:

    print_usage_bug.part.25+0x259/0x268
    mark_lock+0x381/0x567
    __lock_acquire+0x993/0x196e
    lock_acquire+0x139/0x1e1
    _raw_spin_lock+0x32/0x41
    pgd_free+0x19/0x6b
    __mmdrop+0x25/0xb9
    __put_task_struct+0x103/0x11e
    delayed_put_task_struct+0x157/0x15e
    rcu_process_callbacks+0x660/0x800
    __do_softirq+0x1ec/0x4d5
    irq_exit+0x6f/0xd1
    smp_apic_timer_interrupt+0x42/0x4d
    apic_timer_interrupt+0x8e/0xa0

    arch_cpu_idle+0xf/0x11
    default_idle_call+0x32/0x34
    cpu_startup_entry+0x20c/0x399
    start_secondary+0xfe/0x101

    More over commit a79e53d85683 ("x86/mm: Fix pgd_lock deadlock") was
    explicit about pgd_lock not to be called from the irq context. This
    means that __mmdrop called from free_signal_struct has to be postponed
    to a user context. We already have a similar mechanism for mmput_async
    so we can use it here as well. This is safe because mm_count is pinned
    by mm_users.

    This fixes bug introduced by "oom: keep mm of the killed task available"

    Link: http://lkml.kernel.org/r/1472119394-11342-5-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • oom_reap_task has to call exit_oom_victim in order to make sure that the
    oom vicim will not block the oom killer for ever. This is, however,
    opening new problems (e.g oom_killer_disable exclusion - see commit
    74070542099c ("oom, suspend: fix oom_reaper vs. oom_killer_disable
    race")). exit_oom_victim should be only called from the victim's
    context ideally.

    One way to achieve this would be to rely on per mm_struct flags. We
    already have MMF_OOM_REAPED to hide a task from the oom killer since
    "mm, oom: hide mm which is shared with kthread or global init". The
    problem is that the exit path:

    do_exit
    exit_mm
    tsk->mm = NULL;
    mmput
    __mmput
    exit_oom_victim

    doesn't guarantee that exit_oom_victim will get called in a bounded
    amount of time. At least exit_aio depends on IO which might get blocked
    due to lack of memory and who knows what else is lurking there.

    This patch takes a different approach. We remember tsk->mm into the
    signal_struct and bind it to the signal struct life time for all oom
    victims. __oom_reap_task_mm as well as oom_scan_process_thread do not
    have to rely on find_lock_task_mm anymore and they will have a reliable
    reference to the mm struct. As a result all the oom specific
    communication inside the OOM killer can be done via tsk->signal->oom_mm.

    Increasing the signal_struct for something as unlikely as the oom killer
    is far from ideal but this approach will make the code much more
    reasonable and long term we even might want to move task->mm into the
    signal_struct anyway. In the next step we might want to make the oom
    killer exclusion and access to memory reserves completely independent
    which would be also nice.

    Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • "mm, oom_reaper: do not attempt to reap a task twice" tried to give the
    OOM reaper one more chance to retry using MMF_OOM_NOT_REAPABLE flag.
    But the usefulness of the flag is rather limited and actually never
    shown in practice. If the flag is set, it means that the holder of
    mm->mmap_sem cannot call up_write() due to presumably being blocked at
    unkillable wait waiting for other thread's memory allocation. But since
    one of threads sharing that mm will queue that mm immediately via
    task_will_free_mem() shortcut (otherwise, oom_badness() will select the
    same mm again due to oom_score_adj value unchanged), retrying
    MMF_OOM_NOT_REAPABLE mm is unlikely helpful.

    Let's always set MMF_OOM_REAPED.

    Link: http://lkml.kernel.org/r/1472119394-11342-3-git-send-email-mhocko@kernel.org
    Signed-off-by: Tetsuo Handa
    Signed-off-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

04 Oct, 2016

2 commits

  • Pull low-level x86 updates from Ingo Molnar:
    "In this cycle this topic tree has become one of those 'super topics'
    that accumulated a lot of changes:

    - Add CONFIG_VMAP_STACK=y support to the core kernel and enable it on
    x86 - preceded by an array of changes. v4.8 saw preparatory changes
    in this area already - this is the rest of the work. Includes the
    thread stack caching performance optimization. (Andy Lutomirski)

    - switch_to() cleanups and all around enhancements. (Brian Gerst)

    - A large number of dumpstack infrastructure enhancements and an
    unwinder abstraction. The secret long term plan is safe(r) live
    patching plus maybe another attempt at debuginfo based unwinding -
    but all these current bits are standalone enhancements in a frame
    pointer based debug environment as well. (Josh Poimboeuf)

    - More __ro_after_init and const annotations. (Kees Cook)

    - Enable KASLR for the vmemmap memory region. (Thomas Garnier)"

    [ The virtually mapped stack changes are pretty fundamental, and not
    x86-specific per se, even if they are only used on x86 right now. ]

    * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
    x86/asm: Get rid of __read_cr4_safe()
    thread_info: Use unsigned long for flags
    x86/alternatives: Add stack frame dependency to alternative_call_2()
    x86/dumpstack: Fix show_stack() task pointer regression
    x86/dumpstack: Remove dump_trace() and related callbacks
    x86/dumpstack: Convert show_trace_log_lvl() to use the new unwinder
    oprofile/x86: Convert x86_backtrace() to use the new unwinder
    x86/stacktrace: Convert save_stack_trace_*() to use the new unwinder
    perf/x86: Convert perf_callchain_kernel() to use the new unwinder
    x86/unwind: Add new unwind interface and implementations
    x86/dumpstack: Remove NULL task pointer convention
    fork: Optimize task creation by caching two thread stacks per CPU if CONFIG_VMAP_STACK=y
    sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK
    lib/syscall: Pin the task stack in collect_syscall()
    x86/process: Pin the target stack in get_wchan()
    x86/dumpstack: Pin the target stack when dumping it
    kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function
    sched/core: Add try_get_task_stack() and put_task_stack()
    x86/entry/64: Fix a minor comment rebase error
    iommu/amd: Don't put completion-wait semaphore on stack
    ...

    Linus Torvalds
     
  • Pull scheduler changes from Ingo Molnar:
    "The main changes are:

    - irqtime accounting cleanups and enhancements. (Frederic Weisbecker)

    - schedstat debugging enhancements, make it more broadly runtime
    available. (Josh Poimboeuf)

    - More work on asymmetric topology/capacity scheduling. (Morten
    Rasmussen)

    - sched/wait fixes and cleanups. (Oleg Nesterov)

    - PELT (per entity load tracking) improvements. (Peter Zijlstra)

    - Rewrite and enhance select_idle_siblings(). (Peter Zijlstra)

    - sched/numa enhancements/fixes (Rik van Riel)

    - sched/cputime scalability improvements (Stanislaw Gruszka)

    - Load calculation arithmetics fixes. (Dietmar Eggemann)

    - sched/deadline enhancements (Tommaso Cucinotta)

    - Fix utilization accounting when switching to the SCHED_NORMAL
    policy. (Vincent Guittot)

    - ... plus misc cleanups and enhancements"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
    sched/irqtime: Consolidate irqtime flushing code
    sched/irqtime: Consolidate accounting synchronization with u64_stats API
    u64_stats: Introduce IRQs disabled helpers
    sched/irqtime: Remove needless IRQs disablement on kcpustat update
    sched/irqtime: No need for preempt-safe accessors
    sched/fair: Fix min_vruntime tracking
    sched/debug: Add SCHED_WARN_ON()
    sched/core: Fix set_user_nice()
    sched/fair: Introduce set_curr_task() helper
    sched/core, ia64: Rename set_curr_task()
    sched/core: Fix incorrect utilization accounting when switching to fair class
    sched/core: Optimize SCHED_SMT
    sched/core: Rewrite and improve select_idle_siblings()
    sched/core: Replace sd_busy/nr_busy_cpus with sched_domain_shared
    sched/core: Introduce 'struct sched_domain_shared'
    sched/core: Restructure destroy_sched_domain()
    sched/core: Remove unused @cpu argument from destroy_sched_domain*()
    sched/wait: Introduce init_wait_entry()
    sched/wait: Avoid abort_exclusive_wait() in __wait_on_bit_lock()
    sched/wait: Avoid abort_exclusive_wait() in ___wait_event()
    ...

    Linus Torvalds
     

30 Sep, 2016

4 commits

  • Rename the ia64 only set_curr_task() function to free up the name.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • select_idle_siblings() is a known pain point for a number of
    workloads; it either does too much or not enough and sometimes just
    does plain wrong.

    This rewrite attempts to address a number of issues (but sadly not
    all).

    The current code does an unconditional sched_domain iteration; with
    the intent of finding an idle core (on SMT hardware). The problems
    which this patch tries to address are:

    - its pointless to look for idle cores if the machine is real busy;
    at which point you're just wasting cycles.

    - it's behaviour is inconsistent between SMT and !SMT hardware in
    that !SMT hardware ends up doing a scan for any idle CPU in the LLC
    domain, while SMT hardware does a scan for idle cores and if that
    fails, falls back to a scan for idle threads on the 'target' core.

    The new code replaces the sched_domain scan with 3 explicit scans:

    1) search for an idle core in the LLC
    2) search for an idle CPU in the LLC
    3) search for an idle thread in the 'target' core

    where 1 and 3 are conditional on SMT support and 1 and 2 have runtime
    heuristics to skip the step.

    Step 1) is conditional on sd_llc_shared->has_idle_cores; when a cpu
    goes idle and sd_llc_shared->has_idle_cores is false, we scan all SMT
    siblings of the CPU going idle. Similarly, we clear
    sd_llc_shared->has_idle_cores when we fail to find an idle core.

    Step 2) tracks the average cost of the scan and compares this to the
    average idle time guestimate for the CPU doing the wakeup. There is a
    significant fudge factor involved to deal with the variability of the
    averages. Esp. hackbench was sensitive to this.

    Step 3) is unconditional; we assume (also per step 1) that scanning
    all SMT siblings in a core is 'cheap'.

    With this; SMT systems gain step 2, which cures a few benchmarks --
    notably one from Facebook.

    One 'feature' of the sched_domain iteration, which we preserve in the
    new code, is that it would start scanning from the 'target' CPU,
    instead of scanning the cpumask in cpu id order. This avoids multiple
    CPUs in the LLC scanning for idle to gang up and find the same CPU
    quite as much. The down side is that tasks can end up hopping across
    the LLC for no apparent reason.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Move the nr_busy_cpus thing from its hacky sd->parent->groups->sgc
    location into the much more natural sched_domain_shared location.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Since struct sched_domain is strictly per cpu; introduce a structure
    that is shared between all 'identical' sched_domains.

    Limit to SD_SHARE_PKG_RESOURCES domains for now, as we'll only use it
    for shared cache state; if another use comes up later we can easily
    relax this.

    While the sched_group's are normally shared between CPUs, these are
    not natural to use when we need some shared state on a domain level --
    since that would require the domain to have a parent, which is not a
    given.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

22 Sep, 2016

2 commits

  • On fully preemptible kernels _cond_resched() is pointless, so avoid
    emitting any code for it.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mikulas Patocka
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Oleg noted that by making do_exit() use __schedule() for the TASK_DEAD
    context switch, we can avoid the TASK_DEAD special case currently in
    __schedule() because that avoids the extra preempt_disable() from
    schedule().

    In order to facilitate this, create a do_task_dead() helper which we
    place in the scheduler code, such that it can access __schedule().

    Also add some __noreturn annotations to the functions, there's no
    coming back from do_exit().

    Suggested-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Cheng Chao
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: akpm@linux-foundation.org
    Cc: chris@chris-wilson.co.uk
    Cc: tj@kernel.org
    Link: http://lkml.kernel.org/r/20160913163729.GB5012@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

16 Sep, 2016

2 commits

  • We currently keep every task's stack around until the task_struct
    itself is freed. This means that we keep the stack allocation alive
    for longer than necessary and that, under load, we free stacks in
    big batches whenever RCU drops the last task reference. Neither of
    these is good for reuse of cache-hot memory, and freeing in batches
    prevents us from usefully caching small numbers of vmalloced stacks.

    On architectures that have thread_info on the stack, we can't easily
    change this, but on architectures that set THREAD_INFO_IN_TASK, we
    can free it as soon as the task is dead.

    Signed-off-by: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jann Horn
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/08ca06cde00ebed0046c5d26cbbf3fbb7ef5b812.1474003868.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     
  • There are a few places in the kernel that access stack memory
    belonging to a different task. Before we can start freeing task
    stacks before the task_struct is freed, we need a way for those code
    paths to pin the stack.

    Signed-off-by: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jann Horn
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/17a434f50ad3d77000104f21666575e10a9c1fbd.1474003868.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

15 Sep, 2016

1 commit

  • If an arch opts in by setting CONFIG_THREAD_INFO_IN_TASK_STRUCT,
    then thread_info is defined as a single 'u32 flags' and is the first
    entry of task_struct. thread_info::task is removed (it serves no
    purpose if thread_info is embedded in task_struct), and
    thread_info::cpu gets its own slot in task_struct.

    This is heavily based on a patch written by Linus.

    Originally-from: Linus Torvalds
    Signed-off-by: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jann Horn
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/a0898196f0476195ca02713691a5037a14f2aac5.1473801993.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

14 Sep, 2016

1 commit

  • Testing indicates that it is possible to improve performace
    significantly without increasing energy consumption too much by
    teaching cpufreq governors to bump up the CPU performance level if
    the in_iowait flag is set for the task in enqueue_task_fair().

    For this purpose, define a new cpufreq_update_util() flag
    SCHED_CPUFREQ_IOWAIT and modify enqueue_task_fair() to pass that
    flag to cpufreq_update_util() in the in_iowait case. That generally
    requires cpufreq_update_util() to be called directly from there,
    because update_load_avg() may not be invoked in that case.

    Signed-off-by: Rafael J. Wysocki
    Looks-good-to: Steve Muckle
    Acked-by: Peter Zijlstra (Intel)

    Rafael J. Wysocki
     

24 Aug, 2016

1 commit

  • If CONFIG_VMAP_STACK=y is selected, kernel stacks are allocated with
    __vmalloc_node_range().

    Grsecurity has had a similar feature (called GRKERNSEC_KSTACKOVERFLOW=y)
    for a long time.

    Signed-off-by: Andy Lutomirski
    Acked-by: Michal Hocko
    Cc: Alexander Potapenko
    Cc: Andrey Ryabinin
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: Dmitry Vyukov
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/14c07d4fd173a5b117f51e8b939f9f4323e39899.1470907718.git.luto@kernel.org
    [ Minor edits. ]
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

18 Aug, 2016

1 commit

  • Add a topology flag to the sched_domain hierarchy indicating the lowest
    domain level where the full range of CPU capacities is represented by
    the domain members for asymmetric capacity topologies (e.g. ARM
    big.LITTLE).

    The flag is intended to indicate that extra care should be taken when
    placing tasks on CPUs and this level spans all the different types of
    CPUs found in the system (no need to look further up the domain
    hierarchy). This information is currently only available through
    iterating through the capacities of all the CPUs at parent levels in the
    sched_domain hierarchy.

    SD 2 [ 0 1 2 3] SD_ASYM_CPUCAPACITY

    SD 1 [ 0 1] [ 2 3] !SD_ASYM_CPUCAPACITY

    CPU: 0 1 2 3
    capacity: 756 756 1024 1024

    If the topology in the example above is duplicated to create an eight
    CPU example with third sched_domain level on top (SD 3), this level
    should not have the flag set (!SD_ASYM_CPUCAPACITY) as its two group
    would both have all CPU capacities represented within them.

    Signed-off-by: Morten Rasmussen
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dietmar.eggemann@arm.com
    Cc: freedom.tan@mediatek.com
    Cc: keita.kobayashi.ym@renesas.com
    Cc: mgalbraith@suse.de
    Cc: sgurrappadi@nvidia.com
    Cc: vincent.guittot@linaro.org
    Cc: yuyang.du@intel.com
    Link: http://lkml.kernel.org/r/1469453670-2660-6-git-send-email-morten.rasmussen@arm.com
    Signed-off-by: Ingo Molnar

    Morten Rasmussen
     

17 Aug, 2016

1 commit

  • It is useful to know the reason why cpufreq_update_util() has just
    been called and that can be passed as flags to cpufreq_update_util()
    and to the ->func() callback in struct update_util_data. However,
    doing that in addition to passing the util and max arguments they
    already take would be clumsy, so avoid it.

    Instead, use the observation that the schedutil governor is part
    of the scheduler proper, so it can access scheduler data directly.
    This allows the util and max arguments of cpufreq_update_util()
    and the ->func() callback in struct update_util_data to be replaced
    with a flags one, but schedutil has to be modified to follow.

    Thus make the schedutil governor obtain the CFS utilization
    information from the scheduler and use the "RT" and "DL" flags
    instead of the special utilization value of ULONG_MAX to track
    updates from the RT and DL sched classes. Make it non-modular
    too to avoid having to export scheduler variables to modules at
    large.

    Next, update all of the other users of cpufreq_update_util()
    and the ->func() callback in struct update_util_data accordingly.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Viresh Kumar

    Rafael J. Wysocki
     

10 Aug, 2016

2 commits

  • This message is currently really useless since it always prints a value
    that comes from the printk() we just did, e.g.:

    BUG: sleeping function called from invalid context at mm/slab.h:388
    in_atomic(): 0, irqs_disabled(): 0, pid: 31996, name: trinity-c1
    Preemption disabled at:[] down_trylock+0x13/0x80

    BUG: sleeping function called from invalid context at include/linux/freezer.h:56
    in_atomic(): 0, irqs_disabled(): 0, pid: 31996, name: trinity-c1
    Preemption disabled at:[] console_unlock+0x2f7/0x930

    Here, both down_trylock() and console_unlock() is somewhere in the
    printk() path.

    We should save the value before calling printk() and use the saved value
    instead. That immediately reveals the offending callsite:

    BUG: sleeping function called from invalid context at mm/slab.h:388
    in_atomic(): 0, irqs_disabled(): 0, pid: 14971, name: trinity-c2
    Preemption disabled at:[] rhashtable_walk_start+0x46/0x150

    Bug report:

    http://marc.info/?l=linux-netdev&m=146925979821849&w=2

    Signed-off-by: Vegard Nossum
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rusty Russel
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Vegard Nossum
     
  • It is seems that this one escaped Nico's renaming of cpu_power to
    cpu_capacity a while back.

    Signed-off-by: Morten Rasmussen
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dietmar.eggemann@arm.com
    Cc: linux-kernel@vger.kernel.org
    Cc: mgalbraith@suse.de
    Cc: vincent.guittot@linaro.org
    Cc: yuyang.du@intel.com
    Link: http://lkml.kernel.org/r/1466615004-3503-2-git-send-email-morten.rasmussen@arm.com
    Signed-off-by: Ingo Molnar

    Morten Rasmussen