Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

05 Jun, 2014

11 commits

5d4dfddd4 sched: Rename capacity related flags ... Browse Code »

It is better not to think about compute capacity as being equivalent
to "CPU power". The upcoming "power aware" scheduler work may create
confusion with the notion of energy consumption if "power" is used too
liberally.

Let's rename the following feature flags since they do relate to capacity:

SD_SHARE_CPUPOWER -> SD_SHARE_CPUCAPACITY
ARCH_POWER -> ARCH_CAPACITY
NONTASK_POWER -> NONTASK_CAPACITY

Signed-off-by: Nicolas Pitre
Signed-off-by: Peter Zijlstra
Cc: Vincent Guittot
Cc: Daniel Lezcano
Cc: Morten Rasmussen
Cc: "Rafael J. Wysocki"
Cc: linaro-kernel@lists.linaro.org
Cc: Andy Fleming
Cc: Anton Blanchard
Cc: Benjamin Herrenschmidt
Cc: Grant Likely
Cc: Linus Torvalds
Cc: Michael Ellerman
Cc: Paul Gortmaker
Cc: Paul Mackerras
Cc: Preeti U Murthy
Cc: Rob Herring
Cc: Srivatsa S. Bhat
Cc: Toshi Kani
Cc: Vasant Hegde
Cc: Vincent Guittot
Cc: devicetree@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/n/tip-e93lpnxb87owfievqatey6b5@git.kernel.org
Signed-off-by: Ingo Molnar

Nicolas Pitre
2014-06-05 17:52:32 +0800
ca8ce3d0b sched: Final power vs. capacity cleanups ... Browse Code »

It is better not to think about compute capacity as being equivalent
to "CPU power". The upcoming "power aware" scheduler work may create
confusion with the notion of energy consumption if "power" is used too
liberally.

This contains the architecture visible changes. Incidentally, only ARM
takes advantage of the available pow^H^H^Hcapacity scaling hooks and
therefore those changes outside kernel/sched/ are confined to one ARM
specific file. The default arch_scale_smt_power() hook is not overridden
by anyone.

Replacements are as follows:

arch_scale_freq_power --> arch_scale_freq_capacity
arch_scale_smt_power --> arch_scale_smt_capacity
SCHED_POWER_SCALE --> SCHED_CAPACITY_SCALE
SCHED_POWER_SHIFT --> SCHED_CAPACITY_SHIFT

The local usage of "power" in arch/arm/kernel/topology.c is also changed
to "capacity" as appropriate.

Signed-off-by: Nicolas Pitre
Signed-off-by: Peter Zijlstra
Cc: Vincent Guittot
Cc: Daniel Lezcano
Cc: Morten Rasmussen
Cc: "Rafael J. Wysocki"
Cc: linaro-kernel@lists.linaro.org
Cc: Arnd Bergmann
Cc: Dietmar Eggemann
Cc: Grant Likely
Cc: Linus Torvalds
Cc: Mark Brown
Cc: Rob Herring
Cc: Russell King
Cc: Sudeep KarkadaNagesha
Cc: Vincent Guittot
Cc: devicetree@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-48zba9qbznvglwelgq2cfygh@git.kernel.org
Signed-off-by: Ingo Molnar

Nicolas Pitre
2014-06-05 17:52:30 +0800
ced549fa5 sched: Remove remaining dubious usage of "power" ... Browse Code »

It is better not to think about compute capacity as being equivalent
to "CPU power". The upcoming "power aware" scheduler work may create
confusion with the notion of energy consumption if "power" is used too
liberally.

This is the remaining "power" -> "capacity" rename for local symbols.
Those symbols visible to the rest of the kernel are not included yet.

Signed-off-by: Nicolas Pitre
Signed-off-by: Peter Zijlstra
Cc: Vincent Guittot
Cc: Daniel Lezcano
Cc: Morten Rasmussen
Cc: "Rafael J. Wysocki"
Cc: linaro-kernel@lists.linaro.org
Cc: Linus Torvalds
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-yyyhohzhkwnaotr3lx8zd5aa@git.kernel.org
Signed-off-by: Ingo Molnar

Nicolas Pitre
2014-06-05 17:52:29 +0800
63b2ca30b sched: Let 'struct sched_group_power' care about CPU capacity ... Browse Code »

It is better not to think about compute capacity as being equivalent
to "CPU power". The upcoming "power aware" scheduler work may create
confusion with the notion of energy consumption if "power" is used too
liberally.

Since struct sched_group_power is really about compute capacity of sched
groups, let's rename it to struct sched_group_capacity. Similarly sgp
becomes sgc. Related variables and functions dealing with groups are also
adjusted accordingly.

Signed-off-by: Nicolas Pitre
Signed-off-by: Peter Zijlstra
Cc: Vincent Guittot
Cc: Daniel Lezcano
Cc: Morten Rasmussen
Cc: "Rafael J. Wysocki"
Cc: linaro-kernel@lists.linaro.org
Cc: Linus Torvalds
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-5yeix833vvgf2uyj5o36hpu9@git.kernel.org
Signed-off-by: Ingo Molnar

Nicolas Pitre
2014-06-05 17:52:26 +0800
0fedc6c8e sched/fair: Disambiguate existing/remaining "capacity" usage ... Browse Code »

We have "power" (which should actually become "capacity") and "capacity"
which is a scaled down "capacity factor" in terms of unitary tasks.
Let's use "capacity_factor" to make room for proper usage of "capacity"
later.

Signed-off-by: Nicolas Pitre
Signed-off-by: Peter Zijlstra
Cc: Vincent Guittot
Cc: Daniel Lezcano
Cc: Morten Rasmussen
Cc: "Rafael J. Wysocki"
Cc: linaro-kernel@lists.linaro.org
Cc: Linus Torvalds
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-gk1co8sqdev3763opqm6ovml@git.kernel.org
Signed-off-by: Ingo Molnar

Nicolas Pitre
2014-06-05 17:52:25 +0800
1b6a7495d sched/fair: Change "has_capacity" to "has_free_capacity" ... Browse Code »

The capacity of a CPU/group should be some intrinsic value that doesn't
change with task placement. It is like a container which capacity is
stable regardless of the amount of liquid in it (its "utilization")...
unless the container itself is crushed that is, but that's another story.

Therefore let's rename "has_capacity" to "has_free_capacity" in order to
better convey the wanted meaning.

Signed-off-by: Nicolas Pitre
Signed-off-by: Peter Zijlstra
Cc: Vincent Guittot
Cc: Daniel Lezcano
Cc: Morten Rasmussen
Cc: "Rafael J. Wysocki"
Cc: linaro-kernel@lists.linaro.org
Cc: Linus Torvalds
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-djzkk027jm0e8x8jxy70opzh@git.kernel.org
Signed-off-by: Ingo Molnar

Nicolas Pitre
2014-06-05 17:52:22 +0800
5ef20ca18 sched/fair: Remove "power" from 'struct numa_stats' ... Browse Code »

It is better not to think about compute capacity as being equivalent
to "CPU power". The upcoming "power aware" scheduler work may create
confusion with the notion of energy consumption if "power" is used too
liberally.

To make things explicit and not create more confusion with the existing
"capacity" member, let's rename things as follows:

power -> compute_capacity
capacity -> task_capacity

Note: none of those fields are actually used outside update_numa_stats().

Signed-off-by: Nicolas Pitre
Signed-off-by: Peter Zijlstra
Cc: Vincent Guittot
Cc: Daniel Lezcano
Cc: Morten Rasmussen
Cc: "Rafael J. Wysocki"
Cc: linaro-kernel@lists.linaro.org
Cc: Linus Torvalds
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-2e2ndymj5gyshyjq8am79f20@git.kernel.org
Signed-off-by: Ingo Molnar

Nicolas Pitre
2014-06-05 17:52:14 +0800
fa93384f4 sched: Fix signedness bug in yield_to() ... Browse Code »

yield_to() is supposed to return -ESRCH if there is no task to
yield to, but because the type is bool that is the same as returning
true.

The only place I see which cares is kvm_vcpu_on_spin().

Signed-off-by: Dan Carpenter
Reviewed-by: Raghavendra
Signed-off-by: Peter Zijlstra
Cc: Gleb Natapov
Cc: Linus Torvalds
Cc: Paolo Bonzini
Cc: kvm@vger.kernel.org
Link: http://lkml.kernel.org/r/20140523102042.GA7267@mwanda
Signed-off-by: Ingo Molnar

Dan Carpenter
2014-06-05 17:52:13 +0800
2538d960d sched/fair: Use time_after() in record_wakee() ... Browse Code »

To be future-proof and for better readability the time comparisons are modified
to use time_after() instead of plain, error-prone math.

Signed-off-by: Manuel Schölling
Signed-off-by: Peter Zijlstra
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1400780723-24626-1-git-send-email-manuel.schoelling@gmx.de
Signed-off-by: Ingo Molnar

Manuel Schölling
2014-06-05 17:52:02 +0800
ed61bbc69 sched/balancing: Reduce the rate of needless idle load balancing ... Browse Code »

The current no_hz idle load balancer do load balancing for *all* idle cpus,
even though the time due to load balance for a particular
idle cpu could be still a while in the future. This introduces a much
higher load balancing rate than what is necessary. The patch
changes the behavior by only doing idle load balancing on
behalf of an idle cpu only when it is due for load balancing.

On SGI's systems with over 3000 cores, the cpu responsible for idle balancing
got overwhelmed with idle balancing, and introduces a lot of OS noise
to workloads. This patch fixes the issue.

Signed-off-by: Tim Chen
Acked-by: Russ Anderson
Reviewed-by: Rik van Riel
Reviewed-by: Jason Low
Signed-off-by: Peter Zijlstra
Cc: Andrew Morton
Cc: Len Brown
Cc: Dimitri Sivanich
Cc: Hedi Berriche
Cc: Andi Kleen
Cc: MichelLespinasse
Cc: Peter Hurley
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1400621967.2970.280.camel@schen9-DESK
Signed-off-by: Ingo Molnar

Tim Chen
2014-06-05 17:52:01 +0800
51f2176d7 sched/fair: Fix unlocked reads of some cfs_b->quota/period ... Browse Code »

sched_cfs_period_timer() reads cfs_b->period without locks before calling
do_sched_cfs_period_timer(), and similarly unthrottle_offline_cfs_rqs()
would read cfs_b->period without the right lock. Thus a simultaneous
change of bandwidth could cause corruption on any platform where ktime_t
or u64 writes/reads are not atomic.

Extend cfs_b->lock from do_sched_cfs_period_timer() to include the read of
cfs_b->period to solve that issue; unthrottle_offline_cfs_rqs() can just
use 1 rather than the exact quota, much like distribute_cfs_runtime()
does.

There is also an unlocked read of cfs_b->runtime_expires, but a race
there would only delay runtime expiry by a tick. Still, the comparison
should just be != anyway, which clarifies even that problem.

Signed-off-by: Ben Segall
Tested-by: Roman Gushchin
[peterz: Fix compile warn]
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20140519224945.20303.93530.stgit@sword-of-the-dawn.mtv.corp.google.com
Cc: pjt@google.com
Cc: Linus Torvalds
Signed-off-by: Ingo Molnar

Ben Segall
2014-06-05 17:52:00 +0800

22 May, 2014

29 commits

096aa3386 sched/numa: Decay ->wakee_flips instead of zeroing ... Browse Code »

Affine wakeups have the potential to interfere with NUMA placement.
If a task wakes up too many other tasks, affine wakeups will get
disabled.

However, regardless of how many other tasks it wakes up, it gets
re-enabled once a second, potentially interfering with NUMA
placement of other tasks.

By decaying wakee_wakes in half instead of zeroing it, we can avoid
that problem for some workloads.

Signed-off-by: Rik van Riel
Signed-off-by: Peter Zijlstra
Cc: chegu_vinod@hp.com
Cc: umgwanakikbuti@gmail.com
Link: http://lkml.kernel.org/r/20140516001332.67f91af2@annuminas.surriel.com
Signed-off-by: Ingo Molnar

Rik van Riel
2014-05-22 17:16:41 +0800
b1ad065e6 sched/numa: Update migrate_improves/degrades_locality() ... Browse Code »

Update the migrate_improves/degrades_locality() functions with
knowledge of pseudo-interleaving.

Do not consider moving tasks around within the set of group's active
nodes as improving or degrading locality. Instead, leave the load
balancer free to balance the load between a numa_group's active nodes.

Also, switch from the group/task_weight functions to the group/task_fault
functions. The "weight" functions involve a division, but both calls use
the same divisor, so there's no point in doing that from these functions.

On a 4 node (x10 core) system, performance of SPECjbb2005 seems
unaffected, though the number of migrations with 2 8-warehouse wide
instances seems to have almost halved, due to the scheduler running
each instance on a single node.

Signed-off-by: Rik van Riel
Signed-off-by: Peter Zijlstra
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Link: http://lkml.kernel.org/r/20140515130306.61aae7db@cuia.bos.redhat.com
Signed-off-by: Ingo Molnar

Rik van Riel
2014-05-22 17:16:39 +0800
e63da0363 sched/numa: Allow task switch if load imbalance improves ... Browse Code »
26

Currently the NUMA balancing code only allows moving tasks between NUMA
nodes when the load on both nodes is in balance. This breaks down when
the load was imbalanced to begin with.

Allow tasks to be moved between NUMA nodes if the imbalance is small,
or if the new imbalance is be smaller than the original one.

Suggested-by: Peter Zijlstra
Signed-off-by: Rik van Riel
Signed-off-by: Peter Zijlstra
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Signed-off-by: Ingo Molnar
Link: http://lkml.kernel.org/r/20140514132221.274b3463@annuminas.surriel.com

Rik van Riel
2014-05-22 17:16:38 +0800
4027d0808 sched/rt: Fix 'struct sched_dl_entity' and dl_task_time() comments, to match the… ... Browse Code »

… current upstream code

Signed-off-by: xiaofeng.yan <xiaofeng.yan@huawei.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1399605687-18094-1-git-send-email-xiaofeng.yan@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>

xiaofeng.yan
2014-05-22 17:16:37 +0800
7aa2c016d sched: Consolidate open coded implementations of nice level frobbing into nice_t… ... Browse Code »

…o_rlimit() and rlimit_to_nice()

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/a568a1e3cc8e78648f41b5035fa5e381d36274da.1399532322.git.yangds.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>

Dongsheng Yang
2014-05-22 17:16:36 +0800
a803f0261 sched: Initialize rq->age_stamp on processor start ... Browse Code »
10

If the sched_clock time starts at a large value, the kernel will spin
in sched_avg_update for a long time while rq->age_stamp catches up
with rq->clock.

The comment in kernel/sched/clock.c says that there is no strict promise
that it starts at zero. So initialize rq->age_stamp when a cpu starts up
to avoid this.

I was seeing long delays on a simulator that didn't start the clock at
zero. This might also be an issue on reboots on processors that don't
re-initialize the timer to zero on reset, and when using kexec.

Signed-off-by: Corey Minyard
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1399574859-11714-1-git-send-email-minyard@acm.org
Signed-off-by: Ingo Molnar

Corey Minyard
2014-05-22 17:16:35 +0800
724654478 sched, nohz: Change rq->nr_running to always use wrappers ... Browse Code »

Sometimes ->nr_running may cross 2 but interrupt is not being
sent to rq's cpu. In this case we don't reenable the timer.
Looks like this may be the reason for rare unexpected effects,
if nohz is enabled.

Patch replaces all places of direct changing of nr_running
and makes add_nr_running() caring about crossing border.

Signed-off-by: Kirill Tkhai
Acked-by: Frederic Weisbecker
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20140508225830.2469.97461.stgit@localhost
Signed-off-by: Ingo Molnar

Kirill Tkhai
2014-05-22 17:16:33 +0800
52a08ef1f sched: Fix the rq->next_balance logic in rebalance_domains() and idle_balance() ... Browse Code »

Currently, in idle_balance(), we update rq->next_balance when we pull_tasks.
However, it is also important to update this in the !pulled_tasks case too.

When the CPU is "busy" (the CPU isn't idle), rq->next_balance gets computed
using sd->busy_factor (so we increase the balance interval when the CPU is
busy). However, when the CPU goes idle, rq->next_balance could still be set
to a large value that was computed with the sd->busy_factor.

Thus, we need to also update rq->next_balance in idle_balance() in the cases
where !pulled_tasks too, so that rq->next_balance gets updated without taking
the busy_factor into account when the CPU is about to go idle.

This patch makes rq->next_balance get updated independently of whether or
not we pulled_task. Also, we add logic to ensure that we always traverse
at least 1 of the sched domains to get a proper next_balance value for
updating rq->next_balance.

Additionally, since load_balance() modifies the sd->balance_interval, we
need to re-obtain the sched domain's interval after the call to
load_balance() in rebalance_domains() before we update rq->next_balance.

This patch adds and uses 2 new helper functions, update_next_balance() and
get_sd_balance_interval() to update next_balance and obtain the sched
domain's balance_interval.

Signed-off-by: Jason Low
Reviewed-by: Preeti U Murthy
Signed-off-by: Peter Zijlstra
Cc: daniel.lezcano@linaro.org
Cc: alex.shi@linaro.org
Cc: efault@gmx.de
Cc: vincent.guittot@linaro.org
Cc: morten.rasmussen@arm.com
Cc: aswin@hp.com
Link: http://lkml.kernel.org/r/1399596562.2200.7.camel@j-VirtualBox
Signed-off-by: Ingo Molnar

Jason Low
2014-05-22 17:16:32 +0800
a9467fa3c sched: Use clamp() and clamp_val() to make sys_nice() more readable ... Browse Code »

Suggested-by: Kees Cook
Signed-off-by: Dongsheng Yang
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1399541715-19568-1-git-send-email-yangds.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar

Dongsheng Yang
2014-05-22 17:16:31 +0800
caffcdd8d sched: Do not zero sg->cpumask and sg->sgp->power in build_sched_groups() ... Browse Code »

There is no need to zero struct sched_group member cpumask and struct
sched_group_power member power since both structures are already allocated
as zeroed memory in __sdt_alloc().

This patch has been tested with
BUG_ON(!cpumask_empty(sched_group_cpus(sg))); and BUG_ON(sg->sgp->power);
in build_sched_groups() on ARM TC2 and INTEL i5 M520 platform including
CPU hotplug scenarios.

Signed-off-by: Dietmar Eggemann
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1398865178-12577-1-git-send-email-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar

Dietmar Eggemann
2014-05-22 17:16:30 +0800
c515db8cd sched/numa: Fix initialization of sched_domain_topology for NUMA ... Browse Code »

Jet Chen has reported a kernel panics when booting qemu-system-x86_64 with
kvm64 cpu. A panic occured while building the sched_domain.

In sched_init_numa, we create a new topology table in which both default
levels and numa levels are copied. The last row of the table must have a null
pointer in the mask field.

The current implementation doesn't add this last row in the computation of the
table size. So we add 1 row in the allocation size that will be used as the
last row of the table. The kzalloc will ensure that the mask field is NULL.

Reported-by: Jet Chen
Tested-by: Jet Chen
Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra
Cc: fengguang.wu@intel.com
Link: http://lkml.kernel.org/r/1399972261-25693-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar

Vincent Guittot
2014-05-22 17:16:29 +0800
8bf21433f sched: Call select_idle_sibling() when not affine_sd ... Browse Code »

On smaller systems, the top level sched domain will be an affine
domain, and select_idle_sibling is invoked for every SD_WAKE_AFFINE
wakeup. This seems to be working well.

On larger systems, with the node distance between far away NUMA nodes
being > RECLAIM_DISTANCE, select_idle_sibling is only called if the
waker and the wakee are on nodes less than RECLAIM_DISTANCE apart.

This patch leaves in place the policy of not pulling the task across
nodes on such systems, while fixing the issue that select_idle_sibling
is not called at all in certain circumstances.

The code will look for an idle CPU in the same CPU package as the
CPU where the task ran previously.

Signed-off-by: Rik van Riel
Signed-off-by: Peter Zijlstra
Cc: morten.rasmussen@arm.com
Cc: george.mccollister@gmail.com
Cc: ktkhai@parallels.com
Cc: Mel Gorman
Cc: Mike Galbraith
Link: http://lkml.kernel.org/r/20140514114037.2d93266f@annuminas.surriel.com
Signed-off-by: Ingo Molnar

Rik van Riel
2014-05-22 17:16:28 +0800
224006749 sched: Simplify return logic in sched_read_attr() ... Browse Code »

Gotos are chained pointlessly here, and the 'out' label
can be dispensed with.

Signed-off-by: Michael Kerrisk
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/536CEC29.9090503@gmail.com
Signed-off-by: Ingo Molnar

Michael Kerrisk
2014-05-22 17:16:27 +0800
e78c7bca5 sched: Simplify return logic in sched_copy_attr() ... Browse Code »

The logic in this function is a little contorted, clean it up:

* Rather than having chained gotos for the -EFBIG case, just
return -EFBIG directly.

* Now, the label 'out' is no longer needed, and 'ret' must be zero
zero by the time we fall through to this point, so just return 0.

Signed-off-by: Michael Kerrisk
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/536CEC24.9080201@gmail.com
Signed-off-by: Ingo Molnar

Michael Kerrisk
2014-05-22 17:16:26 +0800
3944a9274 sched: Fix exec_start/task_hot on migrated tasks ... Browse Code »

task_hot checks exec_start on any runnable task, but if it has been
migrated since the it last ran, then exec_start is a clock_task from
another cpu. If the old cpu's clock_task was sufficiently far ahead of
this cpu's then the task will not be considered for another migration
until it has run. Instead reset exec_start whenever a task is migrated,
since it is presumably no longer hot anyway.

Signed-off-by: Ben Segall
[ Made it compile. ]
Signed-off-by: Peter Zijlstra
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140515225920.7179.13924.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar

Ben Segall
2014-05-22 17:16:25 +0800
6669dc890 Merge branch 'sched/urgent' into sched/core to avoid conflicts with upcoming changes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2014-05-22 16:55:03 +0800
ec6e7f408 Merge branch 'pm-cpuidle' of git://git.kernel.org/pub/scm/linux/kernel/git/rafae… ... Browse Code »

…l/linux-pm into sched/core

Pull scheduling related CPU idle updates from Rafael J. Wysocki.

Conflicts:
kernel/sched/idle.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>

Ingo Molnar
2014-05-22 16:37:06 +0800
65c2ce700 Merge tag 'v3.15-rc6' into sched/core, to pick up the latest fixes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2014-05-22 16:28:56 +0800
842514849 arm64: Remove TIF_POLLING_NRFLAG ... Browse Code »

The only idle method for arm64 is WFI and it therefore
unconditionally requires the reschedule interrupt when idle.

Suggested-by: Catalin Marinas
Signed-off-by: Peter Zijlstra
Acked-by: Catalin Marinas
Link: http://lkml.kernel.org/r/20140509170649.GG13658@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-05-22 16:22:58 +0800
31e6cdefd metag: Remove TIF_POLLING_NRFLAG ... Browse Code »

The Meta idle function jumps into the interrupt handler which
efficiently blocks waiting for the next interrupt when it reads the
interrupt status register (TXSTATI). No other (polling) idle functions
can be used, therefore TIF_POLLING_NRFLAG is unnecessary, so lets remove
it.

Peter Zijlstra said:
> Most archs have (x86) hlt or (arm) wfi like idle instructions, and if
> that is your only possible idle function, you'll require the interrupt
> to wake up and there's really no point to having the POLLING bit.

Signed-off-by: James Hogan
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/536CEB7E.9080007@imgtec.com
Signed-off-by: Thomas Gleixner
Signed-off-by: Ingo Molnar

James Hogan
2014-05-22 16:22:10 +0800
6acbfb969 sched: Fix hotplug vs. set_cpus_allowed_ptr() ... Browse Code »
57

Lai found that:

WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b()
...
migration_cpu_stop+0x1d/0x22

was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is
always a sub-set of cpu_online_mask.

This isn't true since 5fbd036b552f ("sched: Cleanup cpu_active madness").

So set active and online at the same time to avoid this particular
problem.

Fixes: 5fbd036b552f ("sched: Cleanup cpu_active madness")
Signed-off-by: Lai Jiangshan
Signed-off-by: Peter Zijlstra
Cc: Andrew Morton
Cc: Gautham R. Shenoy
Cc: Linus Torvalds
Cc: Michael wang
Cc: Paul Gortmaker
Cc: Rafael J. Wysocki
Cc: Srivatsa S. Bhat
Cc: Toshi Kani
Link: http://lkml.kernel.org/r/53758B12.8060609@cn.fujitsu.com
Signed-off-by: Ingo Molnar

Lai Jiangshan
2014-05-22 16:21:31 +0800
4dac0b638 sched/cpupri: Replace NR_CPUS arrays ... Browse Code »

Tejun reported that his resume was failing due to order-3 allocations
from sched_domain building.

Replace the NR_CPUS arrays in there with a dynamically allocated
array.

Reported-by: Tejun Heo
Signed-off-by: Peter Zijlstra
Cc: Johannes Weiner
Cc: Steven Rostedt
Cc: Linus Torvalds
Link: http://lkml.kernel.org/n/tip-7cysnkw1gik45r864t1nkudh@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-05-22 16:21:29 +0800
944770ab5 sched/deadline: Replace NR_CPUS arrays ... Browse Code »

Tejun reported that his resume was failing due to order-3 allocations
from sched_domain building.

Replace the NR_CPUS arrays in there with a dynamically allocated
array.

Reported-by: Tejun Heo
Signed-off-by: Peter Zijlstra
Acked-by: Juri Lelli
Cc: Johannes Weiner
Cc: Linus Torvalds
Link: http://lkml.kernel.org/n/tip-kat4gl1m5a6dwy6nzuqox45e@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-05-22 16:21:28 +0800
b0827819b sched/deadline: Restrict user params max value to 2^63 ns ... Browse Code »
5

Michael Kerrisk noticed that creating SCHED_DEADLINE reservations
with certain parameters (e.g, a runtime of something near 2^64 ns)
can cause a system freeze for some amount of time.

The problem is that in the interface we have

u64 sched_runtime;

while internally we need to have a signed runtime (to cope with
budget overruns)

s64 runtime;

At the time we setup a new dl_entity we copy the first value in
the second. The cast turns out with negative values when
sched_runtime is too big, and this causes the scheduler to go crazy
right from the start.

Moreover, considering how we deal with deadlines wraparound

(s64)(a - b) < 0

we also have to restrict acceptable values for sched_{deadline,period}.

This patch fixes the thing checking that user parameters are always
below 2^63 ns (still large enough for everyone).

It also rewrites other conditions that we check, since in
__checkparam_dl we don't have to deal with deadline wraparounds
and what we have now erroneously fails when the difference between
values is too big.

Reported-by: Michael Kerrisk
Suggested-by: Peter Zijlstra
Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra
Cc:
Cc: Dario Faggioli
Cc: Dave Jones
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140513141131.20d944f81633ee937f256385@gmail.com
Signed-off-by: Ingo Molnar

Juri Lelli
2014-05-22 16:21:27 +0800
ce5f7f820 sched/deadline: Change sched_getparam() behaviour vs SCHED_DEADLINE ... Browse Code »
5

The way we read POSIX one should only call sched_getparam() when
sched_getscheduler() returns either SCHED_FIFO or SCHED_RR.

Given that we currently return sched_param::sched_priority=0 for all
others, extend the same behaviour to SCHED_DEADLINE.

Requested-by: Michael Kerrisk
Signed-off-by: Peter Zijlstra
Acked-by: Michael Kerrisk
Cc: Dario Faggioli
Cc: linux-man
Cc: "Michael Kerrisk (man-pages)"
Cc: Juri Lelli
Cc: Linus Torvalds
Cc:
Link: http://lkml.kernel.org/r/20140512205034.GH13467@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-05-22 16:21:26 +0800
dbdb22754 sched: Disallow sched_attr::sched_policy < 0 ... Browse Code »
24

The scheduler uses policy=-1 to preserve the current policy state to
implement sys_sched_setparam(), this got exposed to userspace by
accident through sys_sched_setattr(), cure this.

Reported-by: Michael Kerrisk
Signed-off-by: Peter Zijlstra
Acked-by: Michael Kerrisk
Cc:
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140509085311.GJ30445@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-05-22 16:21:26 +0800
143cf23df sched: Make sched_setattr() correctly return -EFBIG ... Browse Code »
5

The documented[1] behavior of sched_attr() in the proposed man page text is:

sched_attr::size must be set to the size of the structure, as in
sizeof(struct sched_attr), if the provided structure is smaller
than the kernel structure, any additional fields are assumed
'0'. If the provided structure is larger than the kernel structure,
the kernel verifies all additional fields are '0' if not the
syscall will fail with -E2BIG.

As currently implemented, sched_copy_attr() returns -EFBIG for
for this case, but the logic in sys_sched_setattr() converts that
error to -EFAULT. This patch fixes the behavior.

[1] http://thread.gmane.org/gmane.linux.kernel/1615615/focus=1697760

Signed-off-by: Michael Kerrisk
Signed-off-by: Peter Zijlstra
Cc:
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/536CEC17.9070903@gmail.com
Signed-off-by: Ingo Molnar

Michael Kerrisk
2014-05-22 16:21:25 +0800
4b660a7f5 Linux 3.15-rc6 Browse Code »

Linus Torvalds
2014-05-22 05:42:02 +0800
6538d6252 Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc ... Browse Code »

Pull two powerpc fixes from Ben Herrenschmidt:
"Here are a couple of fixes for 3.15. One from Anton fixes a nasty
regression I introduced when trying to fix a loss of irq_work whose
consequences is that we can completely lose timer interrupts on a
CPU... not pretty.

The other one is a change to our PCIe reset hook to use a firmware
call instead of direct config space accesses to trigger a fundamental
reset on the root port. This is necessary so that the FW gets a
chance to disable the link down error monitoring, which would
otherwise trip and cause subsequent fatal EEH error"

* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc: irq work racing with timer interrupt can result in timer interrupt hang
powerpc/powernv: Reset root port in firmware

Linus Torvalds
2014-05-22 04:55:12 +0800