05 Jun, 2014
11 commits
-
It is better not to think about compute capacity as being equivalent
to "CPU power". The upcoming "power aware" scheduler work may create
confusion with the notion of energy consumption if "power" is used too
liberally.Let's rename the following feature flags since they do relate to capacity:
SD_SHARE_CPUPOWER -> SD_SHARE_CPUCAPACITY
ARCH_POWER -> ARCH_CAPACITY
NONTASK_POWER -> NONTASK_CAPACITYSigned-off-by: Nicolas Pitre
Signed-off-by: Peter Zijlstra
Cc: Vincent Guittot
Cc: Daniel Lezcano
Cc: Morten Rasmussen
Cc: "Rafael J. Wysocki"
Cc: linaro-kernel@lists.linaro.org
Cc: Andy Fleming
Cc: Anton Blanchard
Cc: Benjamin Herrenschmidt
Cc: Grant Likely
Cc: Linus Torvalds
Cc: Michael Ellerman
Cc: Paul Gortmaker
Cc: Paul Mackerras
Cc: Preeti U Murthy
Cc: Rob Herring
Cc: Srivatsa S. Bhat
Cc: Toshi Kani
Cc: Vasant Hegde
Cc: Vincent Guittot
Cc: devicetree@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/n/tip-e93lpnxb87owfievqatey6b5@git.kernel.org
Signed-off-by: Ingo Molnar -
It is better not to think about compute capacity as being equivalent
to "CPU power". The upcoming "power aware" scheduler work may create
confusion with the notion of energy consumption if "power" is used too
liberally.This contains the architecture visible changes. Incidentally, only ARM
takes advantage of the available pow^H^H^Hcapacity scaling hooks and
therefore those changes outside kernel/sched/ are confined to one ARM
specific file. The default arch_scale_smt_power() hook is not overridden
by anyone.Replacements are as follows:
arch_scale_freq_power --> arch_scale_freq_capacity
arch_scale_smt_power --> arch_scale_smt_capacity
SCHED_POWER_SCALE --> SCHED_CAPACITY_SCALE
SCHED_POWER_SHIFT --> SCHED_CAPACITY_SHIFTThe local usage of "power" in arch/arm/kernel/topology.c is also changed
to "capacity" as appropriate.Signed-off-by: Nicolas Pitre
Signed-off-by: Peter Zijlstra
Cc: Vincent Guittot
Cc: Daniel Lezcano
Cc: Morten Rasmussen
Cc: "Rafael J. Wysocki"
Cc: linaro-kernel@lists.linaro.org
Cc: Arnd Bergmann
Cc: Dietmar Eggemann
Cc: Grant Likely
Cc: Linus Torvalds
Cc: Mark Brown
Cc: Rob Herring
Cc: Russell King
Cc: Sudeep KarkadaNagesha
Cc: Vincent Guittot
Cc: devicetree@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-48zba9qbznvglwelgq2cfygh@git.kernel.org
Signed-off-by: Ingo Molnar -
It is better not to think about compute capacity as being equivalent
to "CPU power". The upcoming "power aware" scheduler work may create
confusion with the notion of energy consumption if "power" is used too
liberally.This is the remaining "power" -> "capacity" rename for local symbols.
Those symbols visible to the rest of the kernel are not included yet.Signed-off-by: Nicolas Pitre
Signed-off-by: Peter Zijlstra
Cc: Vincent Guittot
Cc: Daniel Lezcano
Cc: Morten Rasmussen
Cc: "Rafael J. Wysocki"
Cc: linaro-kernel@lists.linaro.org
Cc: Linus Torvalds
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-yyyhohzhkwnaotr3lx8zd5aa@git.kernel.org
Signed-off-by: Ingo Molnar -
It is better not to think about compute capacity as being equivalent
to "CPU power". The upcoming "power aware" scheduler work may create
confusion with the notion of energy consumption if "power" is used too
liberally.Since struct sched_group_power is really about compute capacity of sched
groups, let's rename it to struct sched_group_capacity. Similarly sgp
becomes sgc. Related variables and functions dealing with groups are also
adjusted accordingly.Signed-off-by: Nicolas Pitre
Signed-off-by: Peter Zijlstra
Cc: Vincent Guittot
Cc: Daniel Lezcano
Cc: Morten Rasmussen
Cc: "Rafael J. Wysocki"
Cc: linaro-kernel@lists.linaro.org
Cc: Linus Torvalds
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-5yeix833vvgf2uyj5o36hpu9@git.kernel.org
Signed-off-by: Ingo Molnar -
We have "power" (which should actually become "capacity") and "capacity"
which is a scaled down "capacity factor" in terms of unitary tasks.
Let's use "capacity_factor" to make room for proper usage of "capacity"
later.Signed-off-by: Nicolas Pitre
Signed-off-by: Peter Zijlstra
Cc: Vincent Guittot
Cc: Daniel Lezcano
Cc: Morten Rasmussen
Cc: "Rafael J. Wysocki"
Cc: linaro-kernel@lists.linaro.org
Cc: Linus Torvalds
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-gk1co8sqdev3763opqm6ovml@git.kernel.org
Signed-off-by: Ingo Molnar -
The capacity of a CPU/group should be some intrinsic value that doesn't
change with task placement. It is like a container which capacity is
stable regardless of the amount of liquid in it (its "utilization")...
unless the container itself is crushed that is, but that's another story.Therefore let's rename "has_capacity" to "has_free_capacity" in order to
better convey the wanted meaning.Signed-off-by: Nicolas Pitre
Signed-off-by: Peter Zijlstra
Cc: Vincent Guittot
Cc: Daniel Lezcano
Cc: Morten Rasmussen
Cc: "Rafael J. Wysocki"
Cc: linaro-kernel@lists.linaro.org
Cc: Linus Torvalds
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-djzkk027jm0e8x8jxy70opzh@git.kernel.org
Signed-off-by: Ingo Molnar -
It is better not to think about compute capacity as being equivalent
to "CPU power". The upcoming "power aware" scheduler work may create
confusion with the notion of energy consumption if "power" is used too
liberally.To make things explicit and not create more confusion with the existing
"capacity" member, let's rename things as follows:power -> compute_capacity
capacity -> task_capacityNote: none of those fields are actually used outside update_numa_stats().
Signed-off-by: Nicolas Pitre
Signed-off-by: Peter Zijlstra
Cc: Vincent Guittot
Cc: Daniel Lezcano
Cc: Morten Rasmussen
Cc: "Rafael J. Wysocki"
Cc: linaro-kernel@lists.linaro.org
Cc: Linus Torvalds
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-2e2ndymj5gyshyjq8am79f20@git.kernel.org
Signed-off-by: Ingo Molnar -
yield_to() is supposed to return -ESRCH if there is no task to
yield to, but because the type is bool that is the same as returning
true.The only place I see which cares is kvm_vcpu_on_spin().
Signed-off-by: Dan Carpenter
Reviewed-by: Raghavendra
Signed-off-by: Peter Zijlstra
Cc: Gleb Natapov
Cc: Linus Torvalds
Cc: Paolo Bonzini
Cc: kvm@vger.kernel.org
Link: http://lkml.kernel.org/r/20140523102042.GA7267@mwanda
Signed-off-by: Ingo Molnar -
To be future-proof and for better readability the time comparisons are modified
to use time_after() instead of plain, error-prone math.Signed-off-by: Manuel Schölling
Signed-off-by: Peter Zijlstra
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1400780723-24626-1-git-send-email-manuel.schoelling@gmx.de
Signed-off-by: Ingo Molnar -
The current no_hz idle load balancer do load balancing for *all* idle cpus,
even though the time due to load balance for a particular
idle cpu could be still a while in the future. This introduces a much
higher load balancing rate than what is necessary. The patch
changes the behavior by only doing idle load balancing on
behalf of an idle cpu only when it is due for load balancing.On SGI's systems with over 3000 cores, the cpu responsible for idle balancing
got overwhelmed with idle balancing, and introduces a lot of OS noise
to workloads. This patch fixes the issue.Signed-off-by: Tim Chen
Acked-by: Russ Anderson
Reviewed-by: Rik van Riel
Reviewed-by: Jason Low
Signed-off-by: Peter Zijlstra
Cc: Andrew Morton
Cc: Len Brown
Cc: Dimitri Sivanich
Cc: Hedi Berriche
Cc: Andi Kleen
Cc: MichelLespinasse
Cc: Peter Hurley
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1400621967.2970.280.camel@schen9-DESK
Signed-off-by: Ingo Molnar -
sched_cfs_period_timer() reads cfs_b->period without locks before calling
do_sched_cfs_period_timer(), and similarly unthrottle_offline_cfs_rqs()
would read cfs_b->period without the right lock. Thus a simultaneous
change of bandwidth could cause corruption on any platform where ktime_t
or u64 writes/reads are not atomic.Extend cfs_b->lock from do_sched_cfs_period_timer() to include the read of
cfs_b->period to solve that issue; unthrottle_offline_cfs_rqs() can just
use 1 rather than the exact quota, much like distribute_cfs_runtime()
does.There is also an unlocked read of cfs_b->runtime_expires, but a race
there would only delay runtime expiry by a tick. Still, the comparison
should just be != anyway, which clarifies even that problem.Signed-off-by: Ben Segall
Tested-by: Roman Gushchin
[peterz: Fix compile warn]
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20140519224945.20303.93530.stgit@sword-of-the-dawn.mtv.corp.google.com
Cc: pjt@google.com
Cc: Linus Torvalds
Signed-off-by: Ingo Molnar
22 May, 2014
29 commits
-
Affine wakeups have the potential to interfere with NUMA placement.
If a task wakes up too many other tasks, affine wakeups will get
disabled.However, regardless of how many other tasks it wakes up, it gets
re-enabled once a second, potentially interfering with NUMA
placement of other tasks.By decaying wakee_wakes in half instead of zeroing it, we can avoid
that problem for some workloads.Signed-off-by: Rik van Riel
Signed-off-by: Peter Zijlstra
Cc: chegu_vinod@hp.com
Cc: umgwanakikbuti@gmail.com
Link: http://lkml.kernel.org/r/20140516001332.67f91af2@annuminas.surriel.com
Signed-off-by: Ingo Molnar -
Update the migrate_improves/degrades_locality() functions with
knowledge of pseudo-interleaving.Do not consider moving tasks around within the set of group's active
nodes as improving or degrading locality. Instead, leave the load
balancer free to balance the load between a numa_group's active nodes.Also, switch from the group/task_weight functions to the group/task_fault
functions. The "weight" functions involve a division, but both calls use
the same divisor, so there's no point in doing that from these functions.On a 4 node (x10 core) system, performance of SPECjbb2005 seems
unaffected, though the number of migrations with 2 8-warehouse wide
instances seems to have almost halved, due to the scheduler running
each instance on a single node.Signed-off-by: Rik van Riel
Signed-off-by: Peter Zijlstra
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Link: http://lkml.kernel.org/r/20140515130306.61aae7db@cuia.bos.redhat.com
Signed-off-by: Ingo Molnar -
Currently the NUMA balancing code only allows moving tasks between NUMA
nodes when the load on both nodes is in balance. This breaks down when
the load was imbalanced to begin with.Allow tasks to be moved between NUMA nodes if the imbalance is small,
or if the new imbalance is be smaller than the original one.Suggested-by: Peter Zijlstra
Signed-off-by: Rik van Riel
Signed-off-by: Peter Zijlstra
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Signed-off-by: Ingo Molnar
Link: http://lkml.kernel.org/r/20140514132221.274b3463@annuminas.surriel.com -
… current upstream code
Signed-off-by: xiaofeng.yan <xiaofeng.yan@huawei.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1399605687-18094-1-git-send-email-xiaofeng.yan@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org> -
…o_rlimit() and rlimit_to_nice()
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/a568a1e3cc8e78648f41b5035fa5e381d36274da.1399532322.git.yangds.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org> -
If the sched_clock time starts at a large value, the kernel will spin
in sched_avg_update for a long time while rq->age_stamp catches up
with rq->clock.The comment in kernel/sched/clock.c says that there is no strict promise
that it starts at zero. So initialize rq->age_stamp when a cpu starts up
to avoid this.I was seeing long delays on a simulator that didn't start the clock at
zero. This might also be an issue on reboots on processors that don't
re-initialize the timer to zero on reset, and when using kexec.Signed-off-by: Corey Minyard
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1399574859-11714-1-git-send-email-minyard@acm.org
Signed-off-by: Ingo Molnar -
Sometimes ->nr_running may cross 2 but interrupt is not being
sent to rq's cpu. In this case we don't reenable the timer.
Looks like this may be the reason for rare unexpected effects,
if nohz is enabled.Patch replaces all places of direct changing of nr_running
and makes add_nr_running() caring about crossing border.Signed-off-by: Kirill Tkhai
Acked-by: Frederic Weisbecker
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20140508225830.2469.97461.stgit@localhost
Signed-off-by: Ingo Molnar -
Currently, in idle_balance(), we update rq->next_balance when we pull_tasks.
However, it is also important to update this in the !pulled_tasks case too.When the CPU is "busy" (the CPU isn't idle), rq->next_balance gets computed
using sd->busy_factor (so we increase the balance interval when the CPU is
busy). However, when the CPU goes idle, rq->next_balance could still be set
to a large value that was computed with the sd->busy_factor.Thus, we need to also update rq->next_balance in idle_balance() in the cases
where !pulled_tasks too, so that rq->next_balance gets updated without taking
the busy_factor into account when the CPU is about to go idle.This patch makes rq->next_balance get updated independently of whether or
not we pulled_task. Also, we add logic to ensure that we always traverse
at least 1 of the sched domains to get a proper next_balance value for
updating rq->next_balance.Additionally, since load_balance() modifies the sd->balance_interval, we
need to re-obtain the sched domain's interval after the call to
load_balance() in rebalance_domains() before we update rq->next_balance.This patch adds and uses 2 new helper functions, update_next_balance() and
get_sd_balance_interval() to update next_balance and obtain the sched
domain's balance_interval.Signed-off-by: Jason Low
Reviewed-by: Preeti U Murthy
Signed-off-by: Peter Zijlstra
Cc: daniel.lezcano@linaro.org
Cc: alex.shi@linaro.org
Cc: efault@gmx.de
Cc: vincent.guittot@linaro.org
Cc: morten.rasmussen@arm.com
Cc: aswin@hp.com
Link: http://lkml.kernel.org/r/1399596562.2200.7.camel@j-VirtualBox
Signed-off-by: Ingo Molnar -
Suggested-by: Kees Cook
Signed-off-by: Dongsheng Yang
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1399541715-19568-1-git-send-email-yangds.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar -
There is no need to zero struct sched_group member cpumask and struct
sched_group_power member power since both structures are already allocated
as zeroed memory in __sdt_alloc().This patch has been tested with
BUG_ON(!cpumask_empty(sched_group_cpus(sg))); and BUG_ON(sg->sgp->power);
in build_sched_groups() on ARM TC2 and INTEL i5 M520 platform including
CPU hotplug scenarios.Signed-off-by: Dietmar Eggemann
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1398865178-12577-1-git-send-email-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar -
Jet Chen has reported a kernel panics when booting qemu-system-x86_64 with
kvm64 cpu. A panic occured while building the sched_domain.In sched_init_numa, we create a new topology table in which both default
levels and numa levels are copied. The last row of the table must have a null
pointer in the mask field.The current implementation doesn't add this last row in the computation of the
table size. So we add 1 row in the allocation size that will be used as the
last row of the table. The kzalloc will ensure that the mask field is NULL.Reported-by: Jet Chen
Tested-by: Jet Chen
Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra
Cc: fengguang.wu@intel.com
Link: http://lkml.kernel.org/r/1399972261-25693-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar -
On smaller systems, the top level sched domain will be an affine
domain, and select_idle_sibling is invoked for every SD_WAKE_AFFINE
wakeup. This seems to be working well.On larger systems, with the node distance between far away NUMA nodes
being > RECLAIM_DISTANCE, select_idle_sibling is only called if the
waker and the wakee are on nodes less than RECLAIM_DISTANCE apart.This patch leaves in place the policy of not pulling the task across
nodes on such systems, while fixing the issue that select_idle_sibling
is not called at all in certain circumstances.The code will look for an idle CPU in the same CPU package as the
CPU where the task ran previously.Signed-off-by: Rik van Riel
Signed-off-by: Peter Zijlstra
Cc: morten.rasmussen@arm.com
Cc: george.mccollister@gmail.com
Cc: ktkhai@parallels.com
Cc: Mel Gorman
Cc: Mike Galbraith
Link: http://lkml.kernel.org/r/20140514114037.2d93266f@annuminas.surriel.com
Signed-off-by: Ingo Molnar -
Gotos are chained pointlessly here, and the 'out' label
can be dispensed with.Signed-off-by: Michael Kerrisk
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/536CEC29.9090503@gmail.com
Signed-off-by: Ingo Molnar -
The logic in this function is a little contorted, clean it up:
* Rather than having chained gotos for the -EFBIG case, just
return -EFBIG directly.* Now, the label 'out' is no longer needed, and 'ret' must be zero
zero by the time we fall through to this point, so just return 0.Signed-off-by: Michael Kerrisk
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/536CEC24.9080201@gmail.com
Signed-off-by: Ingo Molnar -
task_hot checks exec_start on any runnable task, but if it has been
migrated since the it last ran, then exec_start is a clock_task from
another cpu. If the old cpu's clock_task was sufficiently far ahead of
this cpu's then the task will not be considered for another migration
until it has run. Instead reset exec_start whenever a task is migrated,
since it is presumably no longer hot anyway.Signed-off-by: Ben Segall
[ Made it compile. ]
Signed-off-by: Peter Zijlstra
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140515225920.7179.13924.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar -
Signed-off-by: Ingo Molnar
-
…l/linux-pm into sched/core
Pull scheduling related CPU idle updates from Rafael J. Wysocki.
Conflicts:
kernel/sched/idle.cSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Signed-off-by: Ingo Molnar
-
The only idle method for arm64 is WFI and it therefore
unconditionally requires the reschedule interrupt when idle.Suggested-by: Catalin Marinas
Signed-off-by: Peter Zijlstra
Acked-by: Catalin Marinas
Link: http://lkml.kernel.org/r/20140509170649.GG13658@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner
Signed-off-by: Ingo Molnar -
The Meta idle function jumps into the interrupt handler which
efficiently blocks waiting for the next interrupt when it reads the
interrupt status register (TXSTATI). No other (polling) idle functions
can be used, therefore TIF_POLLING_NRFLAG is unnecessary, so lets remove
it.Peter Zijlstra said:
> Most archs have (x86) hlt or (arm) wfi like idle instructions, and if
> that is your only possible idle function, you'll require the interrupt
> to wake up and there's really no point to having the POLLING bit.Signed-off-by: James Hogan
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/536CEB7E.9080007@imgtec.com
Signed-off-by: Thomas Gleixner
Signed-off-by: Ingo Molnar -
Lai found that:
WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b()
...
migration_cpu_stop+0x1d/0x22was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is
always a sub-set of cpu_online_mask.This isn't true since 5fbd036b552f ("sched: Cleanup cpu_active madness").
So set active and online at the same time to avoid this particular
problem.Fixes: 5fbd036b552f ("sched: Cleanup cpu_active madness")
Signed-off-by: Lai Jiangshan
Signed-off-by: Peter Zijlstra
Cc: Andrew Morton
Cc: Gautham R. Shenoy
Cc: Linus Torvalds
Cc: Michael wang
Cc: Paul Gortmaker
Cc: Rafael J. Wysocki
Cc: Srivatsa S. Bhat
Cc: Toshi Kani
Link: http://lkml.kernel.org/r/53758B12.8060609@cn.fujitsu.com
Signed-off-by: Ingo Molnar -
Tejun reported that his resume was failing due to order-3 allocations
from sched_domain building.Replace the NR_CPUS arrays in there with a dynamically allocated
array.Reported-by: Tejun Heo
Signed-off-by: Peter Zijlstra
Cc: Johannes Weiner
Cc: Steven Rostedt
Cc: Linus Torvalds
Link: http://lkml.kernel.org/n/tip-7cysnkw1gik45r864t1nkudh@git.kernel.org
Signed-off-by: Ingo Molnar -
Tejun reported that his resume was failing due to order-3 allocations
from sched_domain building.Replace the NR_CPUS arrays in there with a dynamically allocated
array.Reported-by: Tejun Heo
Signed-off-by: Peter Zijlstra
Acked-by: Juri Lelli
Cc: Johannes Weiner
Cc: Linus Torvalds
Link: http://lkml.kernel.org/n/tip-kat4gl1m5a6dwy6nzuqox45e@git.kernel.org
Signed-off-by: Ingo Molnar -
Michael Kerrisk noticed that creating SCHED_DEADLINE reservations
with certain parameters (e.g, a runtime of something near 2^64 ns)
can cause a system freeze for some amount of time.The problem is that in the interface we have
u64 sched_runtime;
while internally we need to have a signed runtime (to cope with
budget overruns)s64 runtime;
At the time we setup a new dl_entity we copy the first value in
the second. The cast turns out with negative values when
sched_runtime is too big, and this causes the scheduler to go crazy
right from the start.Moreover, considering how we deal with deadlines wraparound
(s64)(a - b) < 0
we also have to restrict acceptable values for sched_{deadline,period}.
This patch fixes the thing checking that user parameters are always
below 2^63 ns (still large enough for everyone).It also rewrites other conditions that we check, since in
__checkparam_dl we don't have to deal with deadline wraparounds
and what we have now erroneously fails when the difference between
values is too big.Reported-by: Michael Kerrisk
Suggested-by: Peter Zijlstra
Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra
Cc:
Cc: Dario Faggioli
Cc: Dave Jones
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140513141131.20d944f81633ee937f256385@gmail.com
Signed-off-by: Ingo Molnar -
The way we read POSIX one should only call sched_getparam() when
sched_getscheduler() returns either SCHED_FIFO or SCHED_RR.Given that we currently return sched_param::sched_priority=0 for all
others, extend the same behaviour to SCHED_DEADLINE.Requested-by: Michael Kerrisk
Signed-off-by: Peter Zijlstra
Acked-by: Michael Kerrisk
Cc: Dario Faggioli
Cc: linux-man
Cc: "Michael Kerrisk (man-pages)"
Cc: Juri Lelli
Cc: Linus Torvalds
Cc:
Link: http://lkml.kernel.org/r/20140512205034.GH13467@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar -
The scheduler uses policy=-1 to preserve the current policy state to
implement sys_sched_setparam(), this got exposed to userspace by
accident through sys_sched_setattr(), cure this.Reported-by: Michael Kerrisk
Signed-off-by: Peter Zijlstra
Acked-by: Michael Kerrisk
Cc:
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140509085311.GJ30445@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar -
The documented[1] behavior of sched_attr() in the proposed man page text is:
sched_attr::size must be set to the size of the structure, as in
sizeof(struct sched_attr), if the provided structure is smaller
than the kernel structure, any additional fields are assumed
'0'. If the provided structure is larger than the kernel structure,
the kernel verifies all additional fields are '0' if not the
syscall will fail with -E2BIG.As currently implemented, sched_copy_attr() returns -EFBIG for
for this case, but the logic in sys_sched_setattr() converts that
error to -EFAULT. This patch fixes the behavior.[1] http://thread.gmane.org/gmane.linux.kernel/1615615/focus=1697760
Signed-off-by: Michael Kerrisk
Signed-off-by: Peter Zijlstra
Cc:
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/536CEC17.9070903@gmail.com
Signed-off-by: Ingo Molnar -
Pull two powerpc fixes from Ben Herrenschmidt:
"Here are a couple of fixes for 3.15. One from Anton fixes a nasty
regression I introduced when trying to fix a loss of irq_work whose
consequences is that we can completely lose timer interrupts on a
CPU... not pretty.The other one is a change to our PCIe reset hook to use a firmware
call instead of direct config space accesses to trigger a fundamental
reset on the root port. This is necessary so that the FW gets a
chance to disable the link down error monitoring, which would
otherwise trip and cause subsequent fatal EEH error"* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc: irq work racing with timer interrupt can result in timer interrupt hang
powerpc/powernv: Reset root port in firmware