Eric Lee / smarc-fsl-linux-kernel

23 Feb, 2017

1 commit

ebea2cd2b cpufreq: Move gov_attr_* macros to cpufreq.h ... Browse Code »

These macros can be reused by governors which don't use the common
governor code present in cpufreq_governor.c and should be moved to the
relevant header.

Now that they are getting moved to the right header file, reuse them in
schedutil governor as well (that required rename of show/store
routines).

Also create gov_attr_wo() macro for write-only sysfs files, this will be
used by Interactive governor in a later patch.

Signed-off-by: Viresh Kumar

Viresh Kumar
2017-02-23 20:21:42 +0800

24 Nov, 2016

1 commit

83929cce9 sched/autogroup: Fix 64-bit kernel nice level adjustment ... Browse Code »

Michael Kerrisk reported:

> Regarding the previous paragraph... My tests indicate
> that writing *any* value to the autogroup [nice priority level]
> file causes the task group to get a lower priority.

Because autogroup didn't call the then meaningless scale_load()...

Autogroup nice level adjustment has been broken ever since load
resolution was increased for 64-bit kernels. Use scale_load() to
scale group weight.

Michael Kerrisk tested this patch to fix the problem:

> Applied and tested against 4.9-rc6 on an Intel u7 (4 cores).
> Test setup:
>
> Terminal window 1: running 40 CPU burner jobs
> Terminal window 2: running 40 CPU burner jobs
> Terminal window 1: running 1 CPU burner job
>
> Demonstrated that:
> * Writing "0" to the autogroup file for TW1 now causes no change
> to the rate at which the process on the terminal consume CPU.
> * Writing -20 to the autogroup file for TW1 caused those processes
> to get the lion's share of CPU while TW2 TW3 get a tiny amount.
> * Writing -20 to the autogroup files for TW1 and TW3 allowed the
> process on TW3 to get as much CPU as it was getting as when
> the autogroup nice values for both terminals were 0.

Reported-by: Michael Kerrisk
Tested-by: Michael Kerrisk
Signed-off-by: Mike Galbraith
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-man
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1479897217.4306.6.camel@gmx.de
Signed-off-by: Ingo Molnar

Mike Galbraith
2016-11-24 12:45:02 +0800

22 Nov, 2016

2 commits

8e5bfa8c1 sched/autogroup: Do not use autogroup->tg in zombie threads ... Browse Code »

Exactly because for_each_thread() in autogroup_move_group() can't see it
and update its ->sched_task_group before _put() and possibly free().

So the exiting task needs another sched_move_task() before exit_notify()
and we need to re-introduce the PF_EXITING (or similar) check removed by
the previous change for another reason.

Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: hartsjc@redhat.com
Cc: vbendel@redhat.com
Cc: vlovejoy@redhat.com
Link: http://lkml.kernel.org/r/20161114184612.GA15968@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2016-11-22 19:33:43 +0800
18f649ef3 sched/autogroup: Fix autogroup_move_group() to never skip sched_move_task() ... Browse Code »

The PF_EXITING check in task_wants_autogroup() is no longer needed. Remove
it, but see the next patch.

However the comment is correct in that autogroup_move_group() must always
change task_group() for every thread so the sysctl_ check is very wrong;
we can race with cgroups and even sys_setsid() is not safe because a task
running with task_group() == ag->tg must participate in refcounting:

int main(void)
{
int sctl = open("/proc/sys/kernel/sched_autogroup_enabled", O_WRONLY);

assert(sctl > 0);
if (fork()) {
wait(NULL); // destroy the child's ag/tg
pause();
}

assert(pwrite(sctl, "1\n", 2, 0) == 2);
assert(setsid() > 0);
if (fork())
pause();

kill(getppid(), SIGKILL);
sleep(1);

// The child has gone, the grandchild runs with kref == 1
assert(pwrite(sctl, "0\n", 2, 0) == 2);
assert(setsid() > 0);

// runs with the freed ag/tg
for (;;)
sleep(1);

return 0;
}

crashes the kernel. It doesn't really need sleep(1), it doesn't matter if
autogroup_move_group() actually frees the task_group or this happens later.

Reported-by: Vern Lovejoy
Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: hartsjc@redhat.com
Cc: vbendel@redhat.com
Link: http://lkml.kernel.org/r/20161114184609.GA15965@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2016-11-22 19:33:42 +0800

03 Nov, 2016

2 commits

8243d5597 sched/core: Remove pointless printout in sched_show_task() ... Browse Code »

In sched_show_task() we print out a useless hex number, not even a
symbol, and there's a big question mark whether this even makes sense
anyway, I suspect we should just remove it all.

Signed-off-by: Linus Torvalds
Acked-by: Andy Lutomirski
Cc: Peter Zijlstra
Cc: Tetsuo Handa
Cc: Thomas Gleixner
Cc: bp@alien8.de
Cc: brgerst@gmail.com
Cc: jann@thejh.net
Cc: keescook@chromium.org
Cc: linux-api@vger.kernel.org
Cc: tycho.andersen@canonical.com
Link: http://lkml.kernel.org/r/CA+55aFzphURPFzAvU4z6Moy7ZmimcwPuUdYU8bj9z0J+S8X1rw@mail.gmail.com
Signed-off-by: Ingo Molnar

Linus Torvalds
2016-11-03 14:31:34 +0800
382005027 sched/core: Fix oops in sched_show_task() ... Browse Code »

When CONFIG_THREAD_INFO_IN_TASK=y, it is possible that an exited thread
remains in the task list after its stack pointer was already set to NULL.

Therefore, thread_saved_pc() and stack_not_used() in sched_show_task()
will trigger NULL pointer dereference if an attempt to dump such thread's
traces (e.g. SysRq-t, khungtaskd) is made.

Since show_stack() in sched_show_task() calls try_get_task_stack() and
sched_show_task() is called from interrupt context, calling
try_get_task_stack() from sched_show_task() will be safe as well.

Signed-off-by: Tetsuo Handa
Acked-by: Andy Lutomirski
Acked-by: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: bp@alien8.de
Cc: brgerst@gmail.com
Cc: jann@thejh.net
Cc: keescook@chromium.org
Cc: linux-api@vger.kernel.org
Cc: tycho.andersen@canonical.com
Link: http://lkml.kernel.org/r/201611021950.FEJ34368.HFFJOOMLtQOVSF@I-love.SAKURA.ne.jp
Signed-off-by: Ingo Molnar

Tetsuo Handa
2016-11-03 14:27:34 +0800

29 Oct, 2016

1 commit

965c4b7e1 Merge branches 'core-urgent-for-linus', 'irq-urgent-for-linus' and 'sched-urgent… ... Browse Code »

…-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull objtool, irq and scheduler fixes from Ingo Molnar:
"One more objtool fixlet for GCC6 code generation patterns, an irq
DocBook fix and an unused variable warning fix in the scheduler"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Fix rare switch jump table pattern detection

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
doc: Add missing parameter for msi_setup

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Remove unused but set variable 'rq'

Linus Torvalds
2016-10-29 01:12:27 +0800

28 Oct, 2016

1 commit

9dcb8b685 mm: remove per-zone hashtable of bitlock waitqueues ... Browse Code »

The per-zone waitqueues exist because of a scalability issue with the
page waitqueues on some NUMA machines, but it turns out that they hurt
normal loads, and now with the vmalloced stacks they also end up
breaking gfs2 that uses a bit_wait on a stack object:

wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)

where 'gh' can be a reference to the local variable 'mount_gh' on the
stack of fill_super().

The reason the per-zone hash table breaks for this case is that there is
no "zone" for virtual allocations, and trying to look up the physical
page to get at it will fail (with a BUG_ON()).

It turns out that I actually complained to the mm people about the
per-zone hash table for another reason just a month ago: the zone lookup
also hurts the regular use of "unlock_page()" a lot, because the zone
lookup ends up forcing several unnecessary cache misses and generates
horrible code.

As part of that earlier discussion, we had a much better solution for
the NUMA scalability issue - by just making the page lock have a
separate contention bit, the waitqueue doesn't even have to be looked at
for the normal case.

Peter Zijlstra already has a patch for that, but let's see if anybody
even notices. In the meantime, let's fix the actual gfs2 breakage by
simplifying the bitlock waitqueues and removing the per-zone issue.

Reported-by: Andreas Gruenbacher
Tested-by: Bob Peterson
Acked-by: Mel Gorman
Cc: Peter Zijlstra
Cc: Andy Lutomirski
Cc: Steven Whitehouse
Signed-off-by: Linus Torvalds

Linus Torvalds
2016-10-28 00:27:57 +0800

27 Oct, 2016

1 commit

f5d6d2da0 sched/fair: Remove unused but set variable 'rq' ... Browse Code »

Since commit:

8663e24d56dc ("sched/fair: Reorder cgroup creation code")

... the variable 'rq' in alloc_fair_sched_group() is set but no longer used.
Remove it to fix the following GCC warning when building with 'W=1':

kernel/sched/fair.c:8842:13: warning: variable ‘rq’ set but not used [-Wunused-but-set-variable]

Signed-off-by: Tobias Klauser
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20161026113704.8981-1-tklauser@distanz.ch
Signed-off-by: Ingo Molnar

Tobias Klauser
2016-10-27 14:33:52 +0800

19 Oct, 2016

2 commits

b5a9b3407 sched/fair: Fix incorrect task group ->load_avg ... Browse Code »

A scheduler performance regression has been reported by Joseph Salisbury,
which he bisected back to:

3d30544f0212 ("sched/fair: Apply more PELT fixes)

The regression triggers when several levels of task groups are involved
(read: SystemD) and cpu_possible_mask != cpu_present_mask.

The root cause is that group entity's load (tg_child->se[i]->avg.load_avg)
is initialized to scale_load_down(se->load.weight). During the creation of
a child task group, its group entities on possible CPUs are attached to
parent's cfs_rq (tg_parent) and their loads are added to the parent's load
(tg_parent->load_avg) with update_tg_load_avg().

But only the load on online CPUs will then be updated to reflect real load,
whereas load on other CPUs will stay at the initial value.

The result is a tg_parent->load_avg that is higher than the real load, the
weight of group entities (tg_parent->se[i]->load.weight) on online CPUs is
smaller than it should be, and the task group gets a less running time than
what it could expect.

( This situation can be detected with /proc/sched_debug. The ".tg_load_avg"
of the task group will be much higher than sum of ".tg_load_avg_contrib"
of online cfs_rqs of the task group. )

The load of group entities don't have to be intialized to something else
than 0 because their load will increase when an entity is attached.

Reported-by: Joseph Salisbury
Tested-by: Dietmar Eggemann
Signed-off-by: Vincent Guittot
Acked-by: Peter Zijlstra
Cc: # 4.8.x
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: joonwoop@codeaurora.org
Fixes: 3d30544f0212 ("sched/fair: Apply more PELT fixes)
Link: http://lkml.kernel.org/r/1476881123-10159-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar

Vincent Guittot
2016-10-19 21:04:47 +0800
2c11fc87c Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fix from Ingo Molnar:
"Fix a crash that can trigger when racing with CPU hotplug: we didn't
use sched-domains data structures carefully enough in select_idle_cpu()"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix sched domains NULL dereference in select_idle_sibling()

Linus Torvalds
2016-10-19 00:53:59 +0800

16 Oct, 2016

1 commit

9ffc66941 Merge tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux ... Browse Code »

Pull gcc plugins update from Kees Cook:
"This adds a new gcc plugin named "latent_entropy". It is designed to
extract as much possible uncertainty from a running system at boot
time as possible, hoping to capitalize on any possible variation in
CPU operation (due to runtime data differences, hardware differences,
SMP ordering, thermal timing variation, cache behavior, etc).

At the very least, this plugin is a much more comprehensive example
for how to manipulate kernel code using the gcc plugin internals"

* tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
latent_entropy: Mark functions with __latent_entropy
gcc-plugins: Add latent_entropy plugin

Linus Torvalds
2016-10-16 01:03:15 +0800

15 Oct, 2016

1 commit

f34d3606f Merge branch 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:

- tracepoints for basic cgroup management operations added

- kernfs and cgroup path formatting functions updated to behave in the
style of strlcpy()

- non-critical bug fixes

* 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL
cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent()
cpuset: fix error handling regression in proc_cpuset_show()
cgroup: add tracepoints for basic operations
cgroup: make cgroup_path() and friends behave in the style of strlcpy()
kernfs: remove kernfs_path_len()
kernfs: make kernfs_path*() behave in the style of strlcpy()
kernfs: add dummy implementation of kernfs_path_from_node()

Linus Torvalds
2016-10-15 03:18:50 +0800

11 Oct, 2016

2 commits

9cfb38a7b sched/fair: Fix sched domains NULL dereference in select_idle_sibling() ... Browse Code »

Commit:

10e2f1acd01 ("sched/core: Rewrite and improve select_idle_siblings()")

... improved select_idle_sibling(), but also triggered a regression (crash)
during CPU-hotplug:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
IP: [] select_idle_sibling+0x1c2/0x4f0
Call Trace:

select_task_rq_fair+0x749/0x930
? select_task_rq_fair+0xb4/0x930
? __lock_is_held+0x54/0x70
try_to_wake_up+0x19a/0x5b0
default_wake_function+0x12/0x20
autoremove_wake_function+0x12/0x40
__wake_up_common+0x55/0x90
__wake_up+0x39/0x50
wake_up_klogd_work_func+0x40/0x60
irq_work_run_list+0x57/0x80
irq_work_run+0x2c/0x30
smp_irq_work_interrupt+0x2e/0x40
irq_work_interrupt+0x96/0xa0

? _raw_spin_unlock_irqrestore+0x45/0x80
try_to_wake_up+0x4a/0x5b0
wake_up_state+0x10/0x20
__kthread_unpark+0x67/0x70
kthread_unpark+0x22/0x30
cpuhp_online_idle+0x3e/0x70
cpu_startup_entry+0x6a/0x450
start_secondary+0x154/0x180

This can be reproduced by running the ftrace test case of kselftest, the
test case will hot-unplug the CPU and the CPU will attach to the NULL
sched-domain during scheduler teardown.

The step 2 for the rewrite select_idle_siblings():

| Step 2) tracks the average cost of the scan and compares this to the
| average idle time guestimate for the CPU doing the wakeup.

If the CPU which doing the wakeup is the going hot-unplug CPU, then NULL
sched domain will be dereferenced to acquire the average cost of the scan.

This patch fix it by failing the search of an idle CPU in the LLC process
if this sched domain is NULL.

Tested-by: Catalin Marinas
Signed-off-by: Wanpeng Li
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1475971443-3187-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar

Wanpeng Li
2016-10-11 16:40:06 +0800
0766f788e latent_entropy: Mark functions with __latent_entropy ... Browse Code »

The __latent_entropy gcc attribute can be used only on functions and
variables. If it is on a function then the plugin will instrument it for
gathering control-flow entropy. If the attribute is on a variable then
the plugin will initialize it with random contents. The variable must
be an integer, an integer array type or a structure with integer fields.

These specific functions have been selected because they are init
functions (to help gather boot-time entropy), are called at unpredictable
times, or they have variable loops, each of which provide some level of
latent entropy.

Signed-off-by: Emese Revfy
[kees: expanded commit message]
Signed-off-by: Kees Cook

Emese Revfy
2016-10-11 05:51:45 +0800

08 Oct, 2016

1 commit

6727ad9e2 nmi_backtrace: generate one-line reports for idle cpus ... Browse Code »

When doing an nmi backtrace of many cores, most of which are idle, the
output is a little overwhelming and very uninformative. Suppress
messages for cpus that are idling when they are interrupted and just
emit one line, "NMI backtrace for N skipped: idling at pc 0xNNN".

We do this by grouping all the cpuidle code together into a new
.cpuidle.text section, and then checking the address of the interrupted
PC to see if it lies within that section.

This commit suitably tags x86 and tile idle routines, and only adds in
the minimal framework for other architectures.

Link: http://lkml.kernel.org/r/1472487169-14923-5-git-send-email-cmetcalf@mellanox.com
Signed-off-by: Chris Metcalf
Acked-by: Peter Zijlstra (Intel)
Tested-by: Peter Zijlstra (Intel)
Tested-by: Daniel Thompson [arm]
Tested-by: Petr Mladek
Cc: Aaron Tomlin
Cc: Peter Zijlstra (Intel)
Cc: "Rafael J. Wysocki"
Cc: Russell King
Cc: Thomas Gleixner
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chris Metcalf
2016-10-08 09:46:30 +0800

04 Oct, 2016

3 commits

1a4a2bc46 Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull low-level x86 updates from Ingo Molnar:
"In this cycle this topic tree has become one of those 'super topics'
that accumulated a lot of changes:

- Add CONFIG_VMAP_STACK=y support to the core kernel and enable it on
x86 - preceded by an array of changes. v4.8 saw preparatory changes
in this area already - this is the rest of the work. Includes the
thread stack caching performance optimization. (Andy Lutomirski)

- switch_to() cleanups and all around enhancements. (Brian Gerst)

- A large number of dumpstack infrastructure enhancements and an
unwinder abstraction. The secret long term plan is safe(r) live
patching plus maybe another attempt at debuginfo based unwinding -
but all these current bits are standalone enhancements in a frame
pointer based debug environment as well. (Josh Poimboeuf)

- More __ro_after_init and const annotations. (Kees Cook)

- Enable KASLR for the vmemmap memory region. (Thomas Garnier)"

[ The virtually mapped stack changes are pretty fundamental, and not
x86-specific per se, even if they are only used on x86 right now. ]

* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
x86/asm: Get rid of __read_cr4_safe()
thread_info: Use unsigned long for flags
x86/alternatives: Add stack frame dependency to alternative_call_2()
x86/dumpstack: Fix show_stack() task pointer regression
x86/dumpstack: Remove dump_trace() and related callbacks
x86/dumpstack: Convert show_trace_log_lvl() to use the new unwinder
oprofile/x86: Convert x86_backtrace() to use the new unwinder
x86/stacktrace: Convert save_stack_trace_*() to use the new unwinder
perf/x86: Convert perf_callchain_kernel() to use the new unwinder
x86/unwind: Add new unwind interface and implementations
x86/dumpstack: Remove NULL task pointer convention
fork: Optimize task creation by caching two thread stacks per CPU if CONFIG_VMAP_STACK=y
sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK
lib/syscall: Pin the task stack in collect_syscall()
x86/process: Pin the target stack in get_wchan()
x86/dumpstack: Pin the target stack when dumping it
kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function
sched/core: Add try_get_task_stack() and put_task_stack()
x86/entry/64: Fix a minor comment rebase error
iommu/amd: Don't put completion-wait semaphore on stack
...

Linus Torvalds
2016-10-04 07:13:28 +0800
af79ad2b1 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler changes from Ingo Molnar:
"The main changes are:

- irqtime accounting cleanups and enhancements. (Frederic Weisbecker)

- schedstat debugging enhancements, make it more broadly runtime
available. (Josh Poimboeuf)

- More work on asymmetric topology/capacity scheduling. (Morten
Rasmussen)

- sched/wait fixes and cleanups. (Oleg Nesterov)

- PELT (per entity load tracking) improvements. (Peter Zijlstra)

- Rewrite and enhance select_idle_siblings(). (Peter Zijlstra)

- sched/numa enhancements/fixes (Rik van Riel)

- sched/cputime scalability improvements (Stanislaw Gruszka)

- Load calculation arithmetics fixes. (Dietmar Eggemann)

- sched/deadline enhancements (Tommaso Cucinotta)

- Fix utilization accounting when switching to the SCHED_NORMAL
policy. (Vincent Guittot)

- ... plus misc cleanups and enhancements"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
sched/irqtime: Consolidate irqtime flushing code
sched/irqtime: Consolidate accounting synchronization with u64_stats API
u64_stats: Introduce IRQs disabled helpers
sched/irqtime: Remove needless IRQs disablement on kcpustat update
sched/irqtime: No need for preempt-safe accessors
sched/fair: Fix min_vruntime tracking
sched/debug: Add SCHED_WARN_ON()
sched/core: Fix set_user_nice()
sched/fair: Introduce set_curr_task() helper
sched/core, ia64: Rename set_curr_task()
sched/core: Fix incorrect utilization accounting when switching to fair class
sched/core: Optimize SCHED_SMT
sched/core: Rewrite and improve select_idle_siblings()
sched/core: Replace sd_busy/nr_busy_cpus with sched_domain_shared
sched/core: Introduce 'struct sched_domain_shared'
sched/core: Restructure destroy_sched_domain()
sched/core: Remove unused @cpu argument from destroy_sched_domain*()
sched/wait: Introduce init_wait_entry()
sched/wait: Avoid abort_exclusive_wait() in __wait_on_bit_lock()
sched/wait: Avoid abort_exclusive_wait() in ___wait_event()
...

Linus Torvalds
2016-10-04 04:39:00 +0800
4b978934a Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull RCU updates from Ingo Molnar:
"The main changes in this cycle were:

- Expedited grace-period changes, most notably avoiding having user
threads drive expedited grace periods, using a workqueue instead.

- Miscellaneous fixes, including a performance fix for lists that was
sent with the lists modifications.

- CPU hotplug updates, most notably providing exact CPU-online
tracking for RCU. This will in turn allow removal of the checks
supporting RCU's prior heuristic that was based on the assumption
that CPUs would take no longer than one jiffy to come online.

- Torture-test updates.

- Documentation updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
list: Expand list_first_entry_or_null()
torture: TOROUT_STRING(): Insert a space between flag and message
rcuperf: Consistently insert space between flag and message
rcutorture: Print out barrier error as document says
torture: Add task state to writer-task stall printk()s
torture: Convert torture_shutdown() to hrtimer
rcutorture: Convert to hotplug state machine
cpu/hotplug: Get rid of CPU_STARTING reference
rcu: Provide exact CPU-online tracking for RCU
rcu: Avoid redundant quiescent-state chasing
rcu: Don't use modular infrastructure in non-modular code
sched: Make wake_up_nohz_cpu() handle CPUs going offline
rcu: Use rcu_gp_kthread_wake() to wake up grace period kthreads
rcu: Use RCU's online-CPU state for expedited IPI retry
rcu: Exclude RCU-offline CPUs from expedited grace periods
rcu: Make expedited RCU CPU stall warnings respond to controls
rcu: Stop disabling expedited RCU CPU stall warnings
rcu: Drive expedited grace periods from workqueue
rcu: Consolidate expedited grace period machinery
documentation: Record reason for rcu_head two-byte alignment
...

Linus Torvalds
2016-10-04 01:29:53 +0800

02 Oct, 2016

2 commits

7005f6dc6 Merge branch 'pm-cpufreq' ... Browse Code »

* pm-cpufreq: (24 commits)
cpufreq: st: add missing \n to end of dev_err message
cpufreq: kirkwood: add missing \n to end of dev_err messages
cpufreq: CPPC: Avoid overflow when calculating desired_perf
cpufreq: ti: Use generic platdev driver
cpufreq: intel_pstate: Add io_boost trace
cpufreq: intel_pstate: Use IOWAIT flag in Atom algorithm
cpufreq: schedutil: Add iowait boosting
cpufreq / sched: SCHED_CPUFREQ_IOWAIT flag to indicate iowait condition
cpufreq: CPPC: Force reporting values in KHz to fix user space interface
cpufreq: create link to policy only for registered CPUs
intel_pstate: constify local structures
cpufreq: dt: Support governor tunables per policy
cpufreq: dt: Update kconfig description
cpufreq: dt: Remove unused code
MAINTAINERS: Add Documentation/cpu-freq/
cpufreq: dt: Add support for r8a7792
cpufreq / sched: ignore SMT when determining max cpu capacity
cpufreq: Drop unnecessary check from cpufreq_policy_alloc()
ARM: multi_v7_defconfig: Don't attempt to enable schedutil governor as module
ARM: exynos_defconfig: Don't attempt to enable schedutil governor as module
...

Rafael J. Wysocki
2016-10-02 07:42:45 +0800
b6e251178 Merge branch 'pm-cpufreq-sched' into pm-cpufreq Browse Code »

Rafael J. Wysocki
2016-10-02 07:42:33 +0800

30 Sep, 2016

19 commits

447976ef4 sched/irqtime: Consolidate irqtime flushing code ... Browse Code »

The code performing irqtime nsecs stats flushing to kcpustat is roughly
the same for hardirq and softirq. So lets consolidate that common code.

Signed-off-by: Frederic Weisbecker
Reviewed-by: Rik van Riel
Cc: Eric Dumazet
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Paolo Bonzini
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Wanpeng Li
Link: http://lkml.kernel.org/r/1474849761-12678-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2016-09-30 17:46:41 +0800
19d23dbfe sched/irqtime: Consolidate accounting synchronization with u64_stats API ... Browse Code »

The irqtime accounting currently implement its own ad hoc implementation
of u64_stats API. Lets rather consolidate it with the appropriate
library.

Signed-off-by: Frederic Weisbecker
Reviewed-by: Rik van Riel
Cc: Eric Dumazet
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Paolo Bonzini
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Wanpeng Li
Link: http://lkml.kernel.org/r/1474849761-12678-5-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2016-09-30 17:46:40 +0800
2810f611f sched/irqtime: Remove needless IRQs disablement on kcpustat update ... Browse Code »

The callers of the functions performing irqtime kcpustat updates have
IRQS disabled, no need to disable them again.

Signed-off-by: Frederic Weisbecker
Reviewed-by: Rik van Riel
Cc: Eric Dumazet
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Paolo Bonzini
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Wanpeng Li
Link: http://lkml.kernel.org/r/1474849761-12678-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2016-09-30 17:46:39 +0800
f9094a657 sched/irqtime: No need for preempt-safe accessors ... Browse Code »

We can safely use the preempt-unsafe accessors for irqtime when we
flush its counters to kcpustat as IRQs are disabled at this time.

Signed-off-by: Frederic Weisbecker
Reviewed-by: Rik van Riel
Cc: Eric Dumazet
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Paolo Bonzini
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Wanpeng Li
Link: http://lkml.kernel.org/r/1474849761-12678-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2016-09-30 17:46:38 +0800
b60205c7c sched/fair: Fix min_vruntime tracking ... Browse Code »

While going through enqueue/dequeue to review the movement of
set_curr_task() I noticed that the (2nd) update_min_vruntime() call in
dequeue_entity() is suspect.

It turns out, its actually wrong because it will consider
cfs_rq->curr, which could be the entry we just normalized. This mixes
different vruntime forms and leads to fail.

The purpose of the second update_min_vruntime() is to move
min_vruntime forward if the entity we just removed is the one that was
holding it back; _except_ for the DEQUEUE_SAVE case, because then we
know its a temporary removal and it will come back.

However, since we do put_prev_task() _after_ dequeue(), cfs_rq->curr
will still be set (and per the above, can be tranformed into a
different unit), so update_min_vruntime() should also consider
curr->on_rq. This also fixes another corner case where the enqueue
(which also does update_curr()->update_min_vruntime()) happens on the
rq->lock break in schedule(), between dequeue and put_prev_task.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Fixes: 1e876231785d ("sched: Fix ->min_vruntime calculation in dequeue_entity()")
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-09-30 17:03:29 +0800
9148a3a10 sched/debug: Add SCHED_WARN_ON() ... Browse Code »

Provide SCHED_WARN_ON as wrapper for WARN_ON_ONCE() to avoid
CONFIG_SCHED_DEBUG wrappery.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-09-30 17:03:29 +0800
49bd21efe sched/core: Fix set_user_nice() ... Browse Code »

Almost all scheduler functions update state with the following
pattern:

if (queued)
dequeue_task(rq, p, DEQUEUE_SAVE);
if (running)
put_prev_task(rq, p);

/* update state */

if (queued)
enqueue_task(rq, p, ENQUEUE_RESTORE);
if (running)
set_curr_task(rq, p);

set_user_nice() however misses the running part, cure this.

This was found by asserting we never enqueue 'current'.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-09-30 17:03:28 +0800
b2bf6c314 sched/fair: Introduce set_curr_task() helper ... Browse Code »

Now that the ia64 only set_curr_task() symbol is gone, provide a
helper just like put_prev_task().

Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-09-30 17:03:28 +0800
a458ae2ea sched/core, ia64: Rename set_curr_task() ... Browse Code »

Rename the ia64 only set_curr_task() function to free up the name.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-09-30 17:03:27 +0800
a399d2330 sched/core: Fix incorrect utilization accounting when switching to fair class ... Browse Code »

When a task switches to fair scheduling class, the period between now
and the last update of its utilization is accounted as running time
whatever happened during this period. This incorrect accounting applies
to the task and also to the task group branch.

When changing the property of a running task like its list of allowed
CPUs or its scheduling class, we follow the sequence:

- dequeue task
- put task
- change the property
- set task as current task
- enqueue task

The end of the sequence doesn't follow the normal sequence (as per
__schedule()) which is:

- enqueue a task
- then set the task as current task.

This incorrectordering is the root cause of incorrect utilization accounting.
Update the sequence to follow the right one:

- dequeue task
- put task
- change the property
- enqueue task
- set task as current task

Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: linaro-kernel@lists.linaro.org
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1473666472-13749-8-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar

Vincent Guittot
2016-09-30 17:03:27 +0800
1b568f0aa sched/core: Optimize SCHED_SMT ... Browse Code »

Avoid pointless SCHED_SMT code when running on !SMT hardware.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-09-30 17:03:26 +0800
10e2f1acd sched/core: Rewrite and improve select_idle_siblings() ... Browse Code »

select_idle_siblings() is a known pain point for a number of
workloads; it either does too much or not enough and sometimes just
does plain wrong.

This rewrite attempts to address a number of issues (but sadly not
all).

The current code does an unconditional sched_domain iteration; with
the intent of finding an idle core (on SMT hardware). The problems
which this patch tries to address are:

- its pointless to look for idle cores if the machine is real busy;
at which point you're just wasting cycles.

- it's behaviour is inconsistent between SMT and !SMT hardware in
that !SMT hardware ends up doing a scan for any idle CPU in the LLC
domain, while SMT hardware does a scan for idle cores and if that
fails, falls back to a scan for idle threads on the 'target' core.

The new code replaces the sched_domain scan with 3 explicit scans:

1) search for an idle core in the LLC
2) search for an idle CPU in the LLC
3) search for an idle thread in the 'target' core

where 1 and 3 are conditional on SMT support and 1 and 2 have runtime
heuristics to skip the step.

Step 1) is conditional on sd_llc_shared->has_idle_cores; when a cpu
goes idle and sd_llc_shared->has_idle_cores is false, we scan all SMT
siblings of the CPU going idle. Similarly, we clear
sd_llc_shared->has_idle_cores when we fail to find an idle core.

Step 2) tracks the average cost of the scan and compares this to the
average idle time guestimate for the CPU doing the wakeup. There is a
significant fudge factor involved to deal with the variability of the
averages. Esp. hackbench was sensitive to this.

Step 3) is unconditional; we assume (also per step 1) that scanning
all SMT siblings in a core is 'cheap'.

With this; SMT systems gain step 2, which cures a few benchmarks --
notably one from Facebook.

One 'feature' of the sched_domain iteration, which we preserve in the
new code, is that it would start scanning from the 'target' CPU,
instead of scanning the cpumask in cpu id order. This avoids multiple
CPUs in the LLC scanning for idle to gang up and find the same CPU
quite as much. The down side is that tasks can end up hopping across
the LLC for no apparent reason.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-09-30 17:03:09 +0800
0e369d757 sched/core: Replace sd_busy/nr_busy_cpus with sched_domain_shared ... Browse Code »

Move the nr_busy_cpus thing from its hacky sd->parent->groups->sgc
location into the much more natural sched_domain_shared location.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-09-30 16:54:07 +0800
24fc7edb9 sched/core: Introduce 'struct sched_domain_shared' ... Browse Code »

Since struct sched_domain is strictly per cpu; introduce a structure
that is shared between all 'identical' sched_domains.

Limit to SD_SHARE_PKG_RESOURCES domains for now, as we'll only use it
for shared cache state; if another use comes up later we can easily
relax this.

While the sched_group's are normally shared between CPUs, these are
not natural to use when we need some shared state on a domain level --
since that would require the domain to have a parent, which is not a
given.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-09-30 16:54:06 +0800
16f3ef468 sched/core: Restructure destroy_sched_domain() ... Browse Code »

There is no point in doing a call_rcu() for each domain, only do a
callback for the root sched domain and clean up the entire set in one
go.

Also make the entire call chain be called destroy_sched_domain*() to
remove confusion with the free_sched_domains() call, which does an
entirely different thing.

Both cpu_attach_domain() callers of destroy_sched_domain() can live
without the call_rcu() because at those points the sched_domain hasn't
been published yet.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-09-30 16:54:06 +0800
f39180efe sched/core: Remove unused @cpu argument from destroy_sched_domain*() ... Browse Code »

Small cleanup; nothing uses the @cpu argument so make it go away.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-09-30 16:54:05 +0800
0176beaff sched/wait: Introduce init_wait_entry() ... Browse Code »

The partial initialization of wait_queue_t in prepare_to_wait_event() looks
ugly. This was done to shrink .text, but we can simply add the new helper
which does the full initialization and shrink the compiled code a bit more.

And. This way prepare_to_wait_event() can have more users. In particular we
are ready to remove the signal_pending_state() checks from wait_bit_action_f
helpers and change __wait_on_bit_lock() to use prepare_to_wait_event().

Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra (Intel)
Cc: Al Viro
Cc: Bart Van Assche
Cc: Johannes Weiner
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Neil Brown
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20160906140055.GA6167@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2016-09-30 16:54:03 +0800
eaf9ef522 sched/wait: Avoid abort_exclusive_wait() in __wait_on_bit_lock() ... Browse Code »

__wait_on_bit_lock() doesn't need abort_exclusive_wait() too. Right
now it can't use prepare_to_wait_event() (see the next change), but
it can do the additional finish_wait() if action() fails.

abort_exclusive_wait() no longer has callers, remove it.

Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra (Intel)
Cc: Al Viro
Cc: Bart Van Assche
Cc: Johannes Weiner
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Neil Brown
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20160906140053.GA6164@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2016-09-30 16:54:03 +0800
b1ea06a90 sched/wait: Avoid abort_exclusive_wait() in ___wait_event() ... Browse Code »

___wait_event() doesn't really need abort_exclusive_wait(), we can simply
change prepare_to_wait_event() to remove the waiter from q->task_list if
it was interrupted.

This simplifies the code/logic, and this way prepare_to_wait_event() can
have more users, see the next change.

Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra (Intel)
Cc: Al Viro
Cc: Bart Van Assche
Cc: Johannes Weiner
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Neil Brown
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20160908164815.GA18801@redhat.com
Signed-off-by: Ingo Molnar
--
include/linux/wait.h | 7 +------
kernel/sched/wait.c | 35 +++++++++++++++++++++++++----------
2 files changed, 26 insertions(+), 16 deletions(-)

Oleg Nesterov
2016-09-30 16:53:44 +0800