05 Jun, 2014

11 commits

  • It is better not to think about compute capacity as being equivalent
    to "CPU power". The upcoming "power aware" scheduler work may create
    confusion with the notion of energy consumption if "power" is used too
    liberally.

    Let's rename the following feature flags since they do relate to capacity:

    SD_SHARE_CPUPOWER -> SD_SHARE_CPUCAPACITY
    ARCH_POWER -> ARCH_CAPACITY
    NONTASK_POWER -> NONTASK_CAPACITY

    Signed-off-by: Nicolas Pitre
    Signed-off-by: Peter Zijlstra
    Cc: Vincent Guittot
    Cc: Daniel Lezcano
    Cc: Morten Rasmussen
    Cc: "Rafael J. Wysocki"
    Cc: linaro-kernel@lists.linaro.org
    Cc: Andy Fleming
    Cc: Anton Blanchard
    Cc: Benjamin Herrenschmidt
    Cc: Grant Likely
    Cc: Linus Torvalds
    Cc: Michael Ellerman
    Cc: Paul Gortmaker
    Cc: Paul Mackerras
    Cc: Preeti U Murthy
    Cc: Rob Herring
    Cc: Srivatsa S. Bhat
    Cc: Toshi Kani
    Cc: Vasant Hegde
    Cc: Vincent Guittot
    Cc: devicetree@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Link: http://lkml.kernel.org/n/tip-e93lpnxb87owfievqatey6b5@git.kernel.org
    Signed-off-by: Ingo Molnar

    Nicolas Pitre
     
  • It is better not to think about compute capacity as being equivalent
    to "CPU power". The upcoming "power aware" scheduler work may create
    confusion with the notion of energy consumption if "power" is used too
    liberally.

    This contains the architecture visible changes. Incidentally, only ARM
    takes advantage of the available pow^H^H^Hcapacity scaling hooks and
    therefore those changes outside kernel/sched/ are confined to one ARM
    specific file. The default arch_scale_smt_power() hook is not overridden
    by anyone.

    Replacements are as follows:

    arch_scale_freq_power --> arch_scale_freq_capacity
    arch_scale_smt_power --> arch_scale_smt_capacity
    SCHED_POWER_SCALE --> SCHED_CAPACITY_SCALE
    SCHED_POWER_SHIFT --> SCHED_CAPACITY_SHIFT

    The local usage of "power" in arch/arm/kernel/topology.c is also changed
    to "capacity" as appropriate.

    Signed-off-by: Nicolas Pitre
    Signed-off-by: Peter Zijlstra
    Cc: Vincent Guittot
    Cc: Daniel Lezcano
    Cc: Morten Rasmussen
    Cc: "Rafael J. Wysocki"
    Cc: linaro-kernel@lists.linaro.org
    Cc: Arnd Bergmann
    Cc: Dietmar Eggemann
    Cc: Grant Likely
    Cc: Linus Torvalds
    Cc: Mark Brown
    Cc: Rob Herring
    Cc: Russell King
    Cc: Sudeep KarkadaNagesha
    Cc: Vincent Guittot
    Cc: devicetree@vger.kernel.org
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: linux-kernel@vger.kernel.org
    Link: http://lkml.kernel.org/n/tip-48zba9qbznvglwelgq2cfygh@git.kernel.org
    Signed-off-by: Ingo Molnar

    Nicolas Pitre
     
  • It is better not to think about compute capacity as being equivalent
    to "CPU power". The upcoming "power aware" scheduler work may create
    confusion with the notion of energy consumption if "power" is used too
    liberally.

    This is the remaining "power" -> "capacity" rename for local symbols.
    Those symbols visible to the rest of the kernel are not included yet.

    Signed-off-by: Nicolas Pitre
    Signed-off-by: Peter Zijlstra
    Cc: Vincent Guittot
    Cc: Daniel Lezcano
    Cc: Morten Rasmussen
    Cc: "Rafael J. Wysocki"
    Cc: linaro-kernel@lists.linaro.org
    Cc: Linus Torvalds
    Cc: linux-kernel@vger.kernel.org
    Link: http://lkml.kernel.org/n/tip-yyyhohzhkwnaotr3lx8zd5aa@git.kernel.org
    Signed-off-by: Ingo Molnar

    Nicolas Pitre
     
  • It is better not to think about compute capacity as being equivalent
    to "CPU power". The upcoming "power aware" scheduler work may create
    confusion with the notion of energy consumption if "power" is used too
    liberally.

    Since struct sched_group_power is really about compute capacity of sched
    groups, let's rename it to struct sched_group_capacity. Similarly sgp
    becomes sgc. Related variables and functions dealing with groups are also
    adjusted accordingly.

    Signed-off-by: Nicolas Pitre
    Signed-off-by: Peter Zijlstra
    Cc: Vincent Guittot
    Cc: Daniel Lezcano
    Cc: Morten Rasmussen
    Cc: "Rafael J. Wysocki"
    Cc: linaro-kernel@lists.linaro.org
    Cc: Linus Torvalds
    Cc: linux-kernel@vger.kernel.org
    Link: http://lkml.kernel.org/n/tip-5yeix833vvgf2uyj5o36hpu9@git.kernel.org
    Signed-off-by: Ingo Molnar

    Nicolas Pitre
     
  • We have "power" (which should actually become "capacity") and "capacity"
    which is a scaled down "capacity factor" in terms of unitary tasks.
    Let's use "capacity_factor" to make room for proper usage of "capacity"
    later.

    Signed-off-by: Nicolas Pitre
    Signed-off-by: Peter Zijlstra
    Cc: Vincent Guittot
    Cc: Daniel Lezcano
    Cc: Morten Rasmussen
    Cc: "Rafael J. Wysocki"
    Cc: linaro-kernel@lists.linaro.org
    Cc: Linus Torvalds
    Cc: linux-kernel@vger.kernel.org
    Link: http://lkml.kernel.org/n/tip-gk1co8sqdev3763opqm6ovml@git.kernel.org
    Signed-off-by: Ingo Molnar

    Nicolas Pitre
     
  • The capacity of a CPU/group should be some intrinsic value that doesn't
    change with task placement. It is like a container which capacity is
    stable regardless of the amount of liquid in it (its "utilization")...
    unless the container itself is crushed that is, but that's another story.

    Therefore let's rename "has_capacity" to "has_free_capacity" in order to
    better convey the wanted meaning.

    Signed-off-by: Nicolas Pitre
    Signed-off-by: Peter Zijlstra
    Cc: Vincent Guittot
    Cc: Daniel Lezcano
    Cc: Morten Rasmussen
    Cc: "Rafael J. Wysocki"
    Cc: linaro-kernel@lists.linaro.org
    Cc: Linus Torvalds
    Cc: linux-kernel@vger.kernel.org
    Link: http://lkml.kernel.org/n/tip-djzkk027jm0e8x8jxy70opzh@git.kernel.org
    Signed-off-by: Ingo Molnar

    Nicolas Pitre
     
  • It is better not to think about compute capacity as being equivalent
    to "CPU power". The upcoming "power aware" scheduler work may create
    confusion with the notion of energy consumption if "power" is used too
    liberally.

    To make things explicit and not create more confusion with the existing
    "capacity" member, let's rename things as follows:

    power -> compute_capacity
    capacity -> task_capacity

    Note: none of those fields are actually used outside update_numa_stats().

    Signed-off-by: Nicolas Pitre
    Signed-off-by: Peter Zijlstra
    Cc: Vincent Guittot
    Cc: Daniel Lezcano
    Cc: Morten Rasmussen
    Cc: "Rafael J. Wysocki"
    Cc: linaro-kernel@lists.linaro.org
    Cc: Linus Torvalds
    Cc: linux-kernel@vger.kernel.org
    Link: http://lkml.kernel.org/n/tip-2e2ndymj5gyshyjq8am79f20@git.kernel.org
    Signed-off-by: Ingo Molnar

    Nicolas Pitre
     
  • yield_to() is supposed to return -ESRCH if there is no task to
    yield to, but because the type is bool that is the same as returning
    true.

    The only place I see which cares is kvm_vcpu_on_spin().

    Signed-off-by: Dan Carpenter
    Reviewed-by: Raghavendra
    Signed-off-by: Peter Zijlstra
    Cc: Gleb Natapov
    Cc: Linus Torvalds
    Cc: Paolo Bonzini
    Cc: kvm@vger.kernel.org
    Link: http://lkml.kernel.org/r/20140523102042.GA7267@mwanda
    Signed-off-by: Ingo Molnar

    Dan Carpenter
     
  • To be future-proof and for better readability the time comparisons are modified
    to use time_after() instead of plain, error-prone math.

    Signed-off-by: Manuel Schölling
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1400780723-24626-1-git-send-email-manuel.schoelling@gmx.de
    Signed-off-by: Ingo Molnar

    Manuel Schölling
     
  • The current no_hz idle load balancer do load balancing for *all* idle cpus,
    even though the time due to load balance for a particular
    idle cpu could be still a while in the future. This introduces a much
    higher load balancing rate than what is necessary. The patch
    changes the behavior by only doing idle load balancing on
    behalf of an idle cpu only when it is due for load balancing.

    On SGI's systems with over 3000 cores, the cpu responsible for idle balancing
    got overwhelmed with idle balancing, and introduces a lot of OS noise
    to workloads. This patch fixes the issue.

    Signed-off-by: Tim Chen
    Acked-by: Russ Anderson
    Reviewed-by: Rik van Riel
    Reviewed-by: Jason Low
    Signed-off-by: Peter Zijlstra
    Cc: Andrew Morton
    Cc: Len Brown
    Cc: Dimitri Sivanich
    Cc: Hedi Berriche
    Cc: Andi Kleen
    Cc: MichelLespinasse
    Cc: Peter Hurley
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1400621967.2970.280.camel@schen9-DESK
    Signed-off-by: Ingo Molnar

    Tim Chen
     
  • sched_cfs_period_timer() reads cfs_b->period without locks before calling
    do_sched_cfs_period_timer(), and similarly unthrottle_offline_cfs_rqs()
    would read cfs_b->period without the right lock. Thus a simultaneous
    change of bandwidth could cause corruption on any platform where ktime_t
    or u64 writes/reads are not atomic.

    Extend cfs_b->lock from do_sched_cfs_period_timer() to include the read of
    cfs_b->period to solve that issue; unthrottle_offline_cfs_rqs() can just
    use 1 rather than the exact quota, much like distribute_cfs_runtime()
    does.

    There is also an unlocked read of cfs_b->runtime_expires, but a race
    there would only delay runtime expiry by a tick. Still, the comparison
    should just be != anyway, which clarifies even that problem.

    Signed-off-by: Ben Segall
    Tested-by: Roman Gushchin
    [peterz: Fix compile warn]
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140519224945.20303.93530.stgit@sword-of-the-dawn.mtv.corp.google.com
    Cc: pjt@google.com
    Cc: Linus Torvalds
    Signed-off-by: Ingo Molnar

    Ben Segall
     

22 May, 2014

29 commits

  • Affine wakeups have the potential to interfere with NUMA placement.
    If a task wakes up too many other tasks, affine wakeups will get
    disabled.

    However, regardless of how many other tasks it wakes up, it gets
    re-enabled once a second, potentially interfering with NUMA
    placement of other tasks.

    By decaying wakee_wakes in half instead of zeroing it, we can avoid
    that problem for some workloads.

    Signed-off-by: Rik van Riel
    Signed-off-by: Peter Zijlstra
    Cc: chegu_vinod@hp.com
    Cc: umgwanakikbuti@gmail.com
    Link: http://lkml.kernel.org/r/20140516001332.67f91af2@annuminas.surriel.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • Update the migrate_improves/degrades_locality() functions with
    knowledge of pseudo-interleaving.

    Do not consider moving tasks around within the set of group's active
    nodes as improving or degrading locality. Instead, leave the load
    balancer free to balance the load between a numa_group's active nodes.

    Also, switch from the group/task_weight functions to the group/task_fault
    functions. The "weight" functions involve a division, but both calls use
    the same divisor, so there's no point in doing that from these functions.

    On a 4 node (x10 core) system, performance of SPECjbb2005 seems
    unaffected, though the number of migrations with 2 8-warehouse wide
    instances seems to have almost halved, due to the scheduler running
    each instance on a single node.

    Signed-off-by: Rik van Riel
    Signed-off-by: Peter Zijlstra
    Cc: mgorman@suse.de
    Cc: chegu_vinod@hp.com
    Link: http://lkml.kernel.org/r/20140515130306.61aae7db@cuia.bos.redhat.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • Currently the NUMA balancing code only allows moving tasks between NUMA
    nodes when the load on both nodes is in balance. This breaks down when
    the load was imbalanced to begin with.

    Allow tasks to be moved between NUMA nodes if the imbalance is small,
    or if the new imbalance is be smaller than the original one.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Rik van Riel
    Signed-off-by: Peter Zijlstra
    Cc: mgorman@suse.de
    Cc: chegu_vinod@hp.com
    Signed-off-by: Ingo Molnar
    Link: http://lkml.kernel.org/r/20140514132221.274b3463@annuminas.surriel.com

    Rik van Riel
     
  • … current upstream code

    Signed-off-by: xiaofeng.yan <xiaofeng.yan@huawei.com>
    Signed-off-by: Peter Zijlstra <peterz@infradead.org>
    Link: http://lkml.kernel.org/r/1399605687-18094-1-git-send-email-xiaofeng.yan@huawei.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    xiaofeng.yan
     
  • …o_rlimit() and rlimit_to_nice()

    Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
    Signed-off-by: Peter Zijlstra <peterz@infradead.org>
    Link: http://lkml.kernel.org/r/a568a1e3cc8e78648f41b5035fa5e381d36274da.1399532322.git.yangds.fnst@cn.fujitsu.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Dongsheng Yang
     
  • If the sched_clock time starts at a large value, the kernel will spin
    in sched_avg_update for a long time while rq->age_stamp catches up
    with rq->clock.

    The comment in kernel/sched/clock.c says that there is no strict promise
    that it starts at zero. So initialize rq->age_stamp when a cpu starts up
    to avoid this.

    I was seeing long delays on a simulator that didn't start the clock at
    zero. This might also be an issue on reboots on processors that don't
    re-initialize the timer to zero on reset, and when using kexec.

    Signed-off-by: Corey Minyard
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1399574859-11714-1-git-send-email-minyard@acm.org
    Signed-off-by: Ingo Molnar

    Corey Minyard
     
  • Sometimes ->nr_running may cross 2 but interrupt is not being
    sent to rq's cpu. In this case we don't reenable the timer.
    Looks like this may be the reason for rare unexpected effects,
    if nohz is enabled.

    Patch replaces all places of direct changing of nr_running
    and makes add_nr_running() caring about crossing border.

    Signed-off-by: Kirill Tkhai
    Acked-by: Frederic Weisbecker
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140508225830.2469.97461.stgit@localhost
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • Currently, in idle_balance(), we update rq->next_balance when we pull_tasks.
    However, it is also important to update this in the !pulled_tasks case too.

    When the CPU is "busy" (the CPU isn't idle), rq->next_balance gets computed
    using sd->busy_factor (so we increase the balance interval when the CPU is
    busy). However, when the CPU goes idle, rq->next_balance could still be set
    to a large value that was computed with the sd->busy_factor.

    Thus, we need to also update rq->next_balance in idle_balance() in the cases
    where !pulled_tasks too, so that rq->next_balance gets updated without taking
    the busy_factor into account when the CPU is about to go idle.

    This patch makes rq->next_balance get updated independently of whether or
    not we pulled_task. Also, we add logic to ensure that we always traverse
    at least 1 of the sched domains to get a proper next_balance value for
    updating rq->next_balance.

    Additionally, since load_balance() modifies the sd->balance_interval, we
    need to re-obtain the sched domain's interval after the call to
    load_balance() in rebalance_domains() before we update rq->next_balance.

    This patch adds and uses 2 new helper functions, update_next_balance() and
    get_sd_balance_interval() to update next_balance and obtain the sched
    domain's balance_interval.

    Signed-off-by: Jason Low
    Reviewed-by: Preeti U Murthy
    Signed-off-by: Peter Zijlstra
    Cc: daniel.lezcano@linaro.org
    Cc: alex.shi@linaro.org
    Cc: efault@gmx.de
    Cc: vincent.guittot@linaro.org
    Cc: morten.rasmussen@arm.com
    Cc: aswin@hp.com
    Link: http://lkml.kernel.org/r/1399596562.2200.7.camel@j-VirtualBox
    Signed-off-by: Ingo Molnar

    Jason Low
     
  • Suggested-by: Kees Cook
    Signed-off-by: Dongsheng Yang
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1399541715-19568-1-git-send-email-yangds.fnst@cn.fujitsu.com
    Signed-off-by: Ingo Molnar

    Dongsheng Yang
     
  • There is no need to zero struct sched_group member cpumask and struct
    sched_group_power member power since both structures are already allocated
    as zeroed memory in __sdt_alloc().

    This patch has been tested with
    BUG_ON(!cpumask_empty(sched_group_cpus(sg))); and BUG_ON(sg->sgp->power);
    in build_sched_groups() on ARM TC2 and INTEL i5 M520 platform including
    CPU hotplug scenarios.

    Signed-off-by: Dietmar Eggemann
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1398865178-12577-1-git-send-email-dietmar.eggemann@arm.com
    Signed-off-by: Ingo Molnar

    Dietmar Eggemann
     
  • Jet Chen has reported a kernel panics when booting qemu-system-x86_64 with
    kvm64 cpu. A panic occured while building the sched_domain.

    In sched_init_numa, we create a new topology table in which both default
    levels and numa levels are copied. The last row of the table must have a null
    pointer in the mask field.

    The current implementation doesn't add this last row in the computation of the
    table size. So we add 1 row in the allocation size that will be used as the
    last row of the table. The kzalloc will ensure that the mask field is NULL.

    Reported-by: Jet Chen
    Tested-by: Jet Chen
    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra
    Cc: fengguang.wu@intel.com
    Link: http://lkml.kernel.org/r/1399972261-25693-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     
  • On smaller systems, the top level sched domain will be an affine
    domain, and select_idle_sibling is invoked for every SD_WAKE_AFFINE
    wakeup. This seems to be working well.

    On larger systems, with the node distance between far away NUMA nodes
    being > RECLAIM_DISTANCE, select_idle_sibling is only called if the
    waker and the wakee are on nodes less than RECLAIM_DISTANCE apart.

    This patch leaves in place the policy of not pulling the task across
    nodes on such systems, while fixing the issue that select_idle_sibling
    is not called at all in certain circumstances.

    The code will look for an idle CPU in the same CPU package as the
    CPU where the task ran previously.

    Signed-off-by: Rik van Riel
    Signed-off-by: Peter Zijlstra
    Cc: morten.rasmussen@arm.com
    Cc: george.mccollister@gmail.com
    Cc: ktkhai@parallels.com
    Cc: Mel Gorman
    Cc: Mike Galbraith
    Link: http://lkml.kernel.org/r/20140514114037.2d93266f@annuminas.surriel.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • Gotos are chained pointlessly here, and the 'out' label
    can be dispensed with.

    Signed-off-by: Michael Kerrisk
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/536CEC29.9090503@gmail.com
    Signed-off-by: Ingo Molnar

    Michael Kerrisk
     
  • The logic in this function is a little contorted, clean it up:

    * Rather than having chained gotos for the -EFBIG case, just
    return -EFBIG directly.

    * Now, the label 'out' is no longer needed, and 'ret' must be zero
    zero by the time we fall through to this point, so just return 0.

    Signed-off-by: Michael Kerrisk
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/536CEC24.9080201@gmail.com
    Signed-off-by: Ingo Molnar

    Michael Kerrisk
     
  • task_hot checks exec_start on any runnable task, but if it has been
    migrated since the it last ran, then exec_start is a clock_task from
    another cpu. If the old cpu's clock_task was sufficiently far ahead of
    this cpu's then the task will not be considered for another migration
    until it has run. Instead reset exec_start whenever a task is migrated,
    since it is presumably no longer hot anyway.

    Signed-off-by: Ben Segall
    [ Made it compile. ]
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140515225920.7179.13924.stgit@sword-of-the-dawn.mtv.corp.google.com
    Signed-off-by: Ingo Molnar

    Ben Segall
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • …l/linux-pm into sched/core

    Pull scheduling related CPU idle updates from Rafael J. Wysocki.

    Conflicts:
    kernel/sched/idle.c

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • The only idle method for arm64 is WFI and it therefore
    unconditionally requires the reschedule interrupt when idle.

    Suggested-by: Catalin Marinas
    Signed-off-by: Peter Zijlstra
    Acked-by: Catalin Marinas
    Link: http://lkml.kernel.org/r/20140509170649.GG13658@twins.programming.kicks-ass.net
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The Meta idle function jumps into the interrupt handler which
    efficiently blocks waiting for the next interrupt when it reads the
    interrupt status register (TXSTATI). No other (polling) idle functions
    can be used, therefore TIF_POLLING_NRFLAG is unnecessary, so lets remove
    it.

    Peter Zijlstra said:
    > Most archs have (x86) hlt or (arm) wfi like idle instructions, and if
    > that is your only possible idle function, you'll require the interrupt
    > to wake up and there's really no point to having the POLLING bit.

    Signed-off-by: James Hogan
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/536CEB7E.9080007@imgtec.com
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    James Hogan
     
  • Lai found that:

    WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b()
    ...
    migration_cpu_stop+0x1d/0x22

    was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is
    always a sub-set of cpu_online_mask.

    This isn't true since 5fbd036b552f ("sched: Cleanup cpu_active madness").

    So set active and online at the same time to avoid this particular
    problem.

    Fixes: 5fbd036b552f ("sched: Cleanup cpu_active madness")
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Peter Zijlstra
    Cc: Andrew Morton
    Cc: Gautham R. Shenoy
    Cc: Linus Torvalds
    Cc: Michael wang
    Cc: Paul Gortmaker
    Cc: Rafael J. Wysocki
    Cc: Srivatsa S. Bhat
    Cc: Toshi Kani
    Link: http://lkml.kernel.org/r/53758B12.8060609@cn.fujitsu.com
    Signed-off-by: Ingo Molnar

    Lai Jiangshan
     
  • Tejun reported that his resume was failing due to order-3 allocations
    from sched_domain building.

    Replace the NR_CPUS arrays in there with a dynamically allocated
    array.

    Reported-by: Tejun Heo
    Signed-off-by: Peter Zijlstra
    Cc: Johannes Weiner
    Cc: Steven Rostedt
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/n/tip-7cysnkw1gik45r864t1nkudh@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Tejun reported that his resume was failing due to order-3 allocations
    from sched_domain building.

    Replace the NR_CPUS arrays in there with a dynamically allocated
    array.

    Reported-by: Tejun Heo
    Signed-off-by: Peter Zijlstra
    Acked-by: Juri Lelli
    Cc: Johannes Weiner
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/n/tip-kat4gl1m5a6dwy6nzuqox45e@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Michael Kerrisk noticed that creating SCHED_DEADLINE reservations
    with certain parameters (e.g, a runtime of something near 2^64 ns)
    can cause a system freeze for some amount of time.

    The problem is that in the interface we have

    u64 sched_runtime;

    while internally we need to have a signed runtime (to cope with
    budget overruns)

    s64 runtime;

    At the time we setup a new dl_entity we copy the first value in
    the second. The cast turns out with negative values when
    sched_runtime is too big, and this causes the scheduler to go crazy
    right from the start.

    Moreover, considering how we deal with deadlines wraparound

    (s64)(a - b) < 0

    we also have to restrict acceptable values for sched_{deadline,period}.

    This patch fixes the thing checking that user parameters are always
    below 2^63 ns (still large enough for everyone).

    It also rewrites other conditions that we check, since in
    __checkparam_dl we don't have to deal with deadline wraparounds
    and what we have now erroneously fails when the difference between
    values is too big.

    Reported-by: Michael Kerrisk
    Suggested-by: Peter Zijlstra
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Cc:
    Cc: Dario Faggioli
    Cc: Dave Jones
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140513141131.20d944f81633ee937f256385@gmail.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • The way we read POSIX one should only call sched_getparam() when
    sched_getscheduler() returns either SCHED_FIFO or SCHED_RR.

    Given that we currently return sched_param::sched_priority=0 for all
    others, extend the same behaviour to SCHED_DEADLINE.

    Requested-by: Michael Kerrisk
    Signed-off-by: Peter Zijlstra
    Acked-by: Michael Kerrisk
    Cc: Dario Faggioli
    Cc: linux-man
    Cc: "Michael Kerrisk (man-pages)"
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc:
    Link: http://lkml.kernel.org/r/20140512205034.GH13467@laptop.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The scheduler uses policy=-1 to preserve the current policy state to
    implement sys_sched_setparam(), this got exposed to userspace by
    accident through sys_sched_setattr(), cure this.

    Reported-by: Michael Kerrisk
    Signed-off-by: Peter Zijlstra
    Acked-by: Michael Kerrisk
    Cc:
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140509085311.GJ30445@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The documented[1] behavior of sched_attr() in the proposed man page text is:

    sched_attr::size must be set to the size of the structure, as in
    sizeof(struct sched_attr), if the provided structure is smaller
    than the kernel structure, any additional fields are assumed
    '0'. If the provided structure is larger than the kernel structure,
    the kernel verifies all additional fields are '0' if not the
    syscall will fail with -E2BIG.

    As currently implemented, sched_copy_attr() returns -EFBIG for
    for this case, but the logic in sys_sched_setattr() converts that
    error to -EFAULT. This patch fixes the behavior.

    [1] http://thread.gmane.org/gmane.linux.kernel/1615615/focus=1697760

    Signed-off-by: Michael Kerrisk
    Signed-off-by: Peter Zijlstra
    Cc:
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/536CEC17.9070903@gmail.com
    Signed-off-by: Ingo Molnar

    Michael Kerrisk
     
  • Linus Torvalds
     
  • Pull two powerpc fixes from Ben Herrenschmidt:
    "Here are a couple of fixes for 3.15. One from Anton fixes a nasty
    regression I introduced when trying to fix a loss of irq_work whose
    consequences is that we can completely lose timer interrupts on a
    CPU... not pretty.

    The other one is a change to our PCIe reset hook to use a firmware
    call instead of direct config space accesses to trigger a fundamental
    reset on the root port. This is necessary so that the FW gets a
    chance to disable the link down error monitoring, which would
    otherwise trip and cause subsequent fatal EEH error"

    * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
    powerpc: irq work racing with timer interrupt can result in timer interrupt hang
    powerpc/powernv: Reset root port in firmware

    Linus Torvalds