02 Oct, 2012

1 commit

  • Pull scheduler changes from Ingo Molnar:
    "Continued quest to clean up and enhance the cputime code by Frederic
    Weisbecker, in preparation for future tickless kernel features.

    Other than that, smallish changes."

    Fix up trivial conflicts due to additions next to each other in arch/{x86/}Kconfig

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    cputime: Make finegrained irqtime accounting generally available
    cputime: Gather time/stats accounting config options into a single menu
    ia64: Reuse system and user vtime accounting functions on task switch
    ia64: Consolidate user vtime accounting
    vtime: Consolidate system/idle context detection
    cputime: Use a proper subsystem naming for vtime related APIs
    sched: cpu_power: enable ARCH_POWER
    sched/nohz: Clean up select_nohz_load_balancer()
    sched: Fix load avg vs. cpu-hotplug
    sched: Remove __ARCH_WANT_INTERRUPTS_ON_CTXSW
    sched: Fix nohz_idle_balance()
    sched: Remove useless code in yield_to()
    sched: Add time unit suffix to sched sysctl knobs
    sched/debug: Limit sd->*_idx range on sysctl
    sched: Remove AFFINE_WAKEUPS feature flag
    s390: Remove leftover account_tick_vtime() header
    cputime: Consolidate vtime handling on context switch
    sched: Move cputime code to its own file
    cputime: Generalize CONFIG_VIRT_CPU_ACCOUNTING
    tile: Remove SD_PREFER_LOCAL leftover
    ...

    Linus Torvalds
     

17 Sep, 2012

1 commit

  • This reverts commit 970e178985cadbca660feb02f4d2ee3a09f7fdda.

    Nikolay Ulyanitsky reported thatthe 3.6-rc5 kernel has a 15-20%
    performance drop on PostgreSQL 9.2 on his machine (running "pgbench").

    Borislav Petkov was able to reproduce this, and bisected it to this
    commit 970e178985ca ("sched: Improve scalability via 'CPU buddies' ...")
    apparently because the new single-idle-buddy model simply doesn't find
    idle CPU's to reschedule on aggressively enough.

    Mike Galbraith suspects that it is likely due to the user-mode spinlocks
    in PostgreSQL not reacting well to preemption, but we don't really know
    the details - I'll just revert the commit for now.

    There are hopefully other approaches to improve scheduler scalability
    without it causing these kinds of downsides.

    Reported-by: Nikolay Ulyanitsky
    Bisected-by: Borislav Petkov
    Acked-by: Mike Galbraith
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

13 Sep, 2012

2 commits

  • There is no load_balancer to be selected now. It just sets the
    state of the nohz tick to stop.

    So rename the function, pass the 'cpu' as a parameter and then
    remove the useless call from tick_nohz_restart_sched_tick().

    [ s/set_nohz_tick_stopped/nohz_balance_enter_idle/g
    s/clear_nohz_tick_stopped/nohz_balance_exit_idle/g ]
    Signed-off-by: Alex Shi
    Acked-by: Suresh Siddha
    Cc: Venkatesh Pallipadi
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1347261059-24747-1-git-send-email-alex.shi@intel.com
    Signed-off-by: Ingo Molnar

    Alex Shi
     
  • On tickless systems, one CPU runs load balance for all idle CPUs.

    The cpu_load of this CPU is updated before starting the load balance
    of each other idle CPUs. We should instead update the cpu_load of
    the balance_cpu.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra
    Cc: Venkatesh Pallipadi
    Cc: Suresh Siddha
    Link: http://lkml.kernel.org/r/1347509486-8688-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     

04 Sep, 2012

3 commits

  • Merge in the current fixes branch, we are going to apply dependent patches.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Fix two kernel-doc warnings in kernel/sched/fair.c:

    Warning(kernel/sched/fair.c:3660): Excess function parameter 'cpus' description in 'update_sg_lb_stats'
    Warning(kernel/sched/fair.c:3806): Excess function parameter 'cpus' description in 'update_sd_lb_stats'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/50303714.3090204@xenotime.net
    Signed-off-by: Ingo Molnar

    Randy Dunlap
     
  • migrate_tasks() uses _pick_next_task_rt() to get tasks from the
    real-time runqueues to be migrated. When rt_rq is throttled
    _pick_next_task_rt() won't return anything, in which case
    migrate_tasks() can't move all threads over and gets stuck in an
    infinite loop.

    Instead unthrottle rt runqueues before migrating tasks.

    Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair()

    Signed-off-by: Peter Boonstoppel
    Signed-off-by: Peter Zijlstra
    Cc: Paul Turner
    Link: http://lkml.kernel.org/r/5FBF8E85CA34454794F0F7ECBA79798F379D3648B7@HQMAIL04.nvidia.com
    Signed-off-by: Ingo Molnar

    Peter Boonstoppel
     

14 Aug, 2012

4 commits

  • Since power saving code was removed from sched now, the implement
    code is out of service in this function, and even pollute other logical.
    like, 'want_sd' never has chance to be set '0', that remove the effect
    of SD_WAKE_AFFINE here.

    So, clean up the obsolete code, includes SD_PREFER_LOCAL.

    Signed-off-by: Alex Shi
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/5028F431.6000306@intel.com
    Signed-off-by: Thomas Gleixner

    Alex Shi
     
  • As we already have dst_rq in lb_env, using or changing "this_rq" do not
    make sense.

    This patch will replace "this_rq" with dst_rq in load_balance, and we
    don't need to change "this_rq" while process LBF_SOME_PINNED any more.

    Signed-off-by: Michael Wang
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/501F8357.3070102@linux.vnet.ibm.com
    Signed-off-by: Thomas Gleixner

    Michael Wang
     
  • It should be sched_nr_latency so fix it before it annoys me more.

    Signed-off-by: Borislav Petkov
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1344435364-18632-1-git-send-email-bp@amd64.org
    Signed-off-by: Thomas Gleixner

    Borislav Petkov
     
  • Peter Portante reported that for large cgroup hierarchies (and or on
    large CPU counts) we get immense lock contention on rq->lock and stuff
    stops working properly.

    His workload was a ton of processes, each in their own cgroup,
    everybody idling except for a sporadic wakeup once every so often.

    It was found that:

    schedule()
    idle_balance()
    load_balance()
    local_irq_save()
    double_rq_lock()
    update_h_load()
    walk_tg_tree(tg_load_down)
    tg_load_down()

    Results in an entire cgroup hierarchy walk under rq->lock for every
    new-idle balance and since new-idle balance isn't throttled this
    results in a lot of work while holding the rq->lock.

    This patch does two things, it removes the work from under rq->lock
    based on the good principle of race and pray which is widely employed
    in the load-balancer as a whole. And secondly it throttles the
    update_h_load() calculation to max once per jiffy.

    I considered excluding update_h_load() for new-idle balance
    all-together, but purely relying on regular balance passes to update
    this data might not work out under some rare circumstances where the
    new-idle busiest isn't the regular busiest for a while (unlikely, but
    a nightmare to debug if someone hits it and suffers).

    Cc: pjt@google.com
    Cc: Larry Woodman
    Cc: Mike Galbraith
    Reported-by: Peter Portante
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-aaarrzfpnaam7pqrekofu8a6@git.kernel.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     

31 Jul, 2012

1 commit


24 Jul, 2012

4 commits

  • Current load balance scheme requires only one cpu in a
    sched_group (balance_cpu) to look at other peer sched_groups for
    imbalance and pull tasks towards itself from a busy cpu. Tasks
    thus pulled by balance_cpu could later get picked up by cpus
    that are in the same sched_group as that of balance_cpu.

    This scheme however fails to pull tasks that are not allowed to
    run on balance_cpu (but are allowed to run on other cpus in its
    sched_group). That can affect fairness and in some worst case
    scenarios cause starvation.

    Consider a two core (2 threads/core) system running tasks as
    below:

    Core0 Core1
    / \ / \
    C0 C1 C2 C3
    | | | |
    v v v v
    F0 T1 F1 [idle]
    T2

    F0 = SCHED_FIFO task (pinned to C0)
    F1 = SCHED_FIFO task (pinned to C2)
    T1 = SCHED_OTHER task (pinned to C1)
    T2 = SCHED_OTHER task (pinned to C1 and C2)

    F1 could become a cpu hog, which will starve T2 unless C1 pulls
    it. Between C0 and C1 however, C0 is required to look for
    imbalance between cores, which will fail to pull T2 towards
    Core0. T2 will starve eternally in this case. The same scenario
    can arise in presence of non-rt tasks as well (say we replace F1
    with high irq load).

    We tackle this problem by having balance_cpu move pinned tasks
    to one of its sibling cpus (where they can run). We first check
    if load balance goal can be met by ignoring pinned tasks,
    failing which we retry move_tasks() with a new env->dst_cpu.

    This patch modifies load balance semantics on who can move load
    towards a given cpu in a given sched_domain.

    Before this patch, a given_cpu or a ilb_cpu acting on behalf of
    an idle given_cpu is responsible for moving load to given_cpu.

    With this patch applied, balance_cpu can in addition decide on
    moving some load to a given_cpu.

    There is a remote possibility that excess load could get moved
    as a result of this (balance_cpu and given_cpu/ilb_cpu deciding
    *independently* and at *same* time to move some load to a
    given_cpu). However we should see less of such conflicting
    decisions in practice and moreover subsequent load balance
    cycles should correct the excess load moved to given_cpu.

    Signed-off-by: Srivatsa Vaddagiri
    Signed-off-by: Prashanth Nageshappa
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/4FE06CDB.2060605@linux.vnet.ibm.com
    [ minor edits ]
    Signed-off-by: Ingo Molnar

    Srivatsa Vaddagiri
     
  • While load balancing, if all tasks on the source runqueue are pinned,
    we retry after excluding the corresponding source cpu. However, loop counters
    env.loop and env.loop_break are not reset before retrying, which can lead
    to failure in moving the tasks. In this patch we reset env.loop and
    env.loop_break to their inital values before we retry.

    Signed-off-by: Prashanth Nageshappa
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/4FE06EEF.2090709@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Prashanth Nageshappa
     
  • Members of 'struct lb_env' are not in appropriate order to reuse compiler
    added padding on 64bit architectures. In this patch we reorder those struct
    members and help reduce the size of the structure from 96 bytes to 80
    bytes on 64 bit architectures.

    Suggested-by: Srivatsa Vaddagiri
    Signed-off-by: Prashanth Nageshappa
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/4FE06DDE.7000403@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Prashanth Nageshappa
     
  • Traversing an entire package is not only expensive, it also leads to tasks
    bouncing all over a partially idle and possible quite large package. Fix
    that up by assigning a 'buddy' CPU to try to motivate. Each buddy may try
    to motivate that one other CPU, if it's busy, tough, it may then try its
    SMT sibling, but that's all this optimization is allowed to cost.

    Sibling cache buddies are cross-wired to prevent bouncing.

    4 socket 40 core + SMT Westmere box, single 30 sec tbench runs, higher is better:

    clients 1 2 4 8 16 32 64 128
    ..........................................................................
    pre 30 41 118 645 3769 6214 12233 14312
    post 299 603 1211 2418 4697 6847 11606 14557

    A nice increase in performance.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1339471112.7352.32.camel@marge.simpson.net
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     

09 Jun, 2012

2 commits

  • Pull scheduler fixes from Ingo Molnar.

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched: Fix the relax_domain_level boot parameter
    sched: Validate assumptions in sched_init_numa()
    sched: Always initialize cpu-power
    sched: Fix domain iteration
    sched/rt: Fix lockdep annotation within find_lock_lowest_rq()
    sched/numa: Load balance between remote nodes
    sched/x86: Calculate booted cores after construction of sibling_mask

    Linus Torvalds
     
  • Fix lots of new kernel-doc warnings in kernel/sched/fair.c:

    Warning(kernel/sched/fair.c:3625): No description found for parameter 'env'
    Warning(kernel/sched/fair.c:3625): Excess function parameter 'sd' description in 'update_sg_lb_stats'
    Warning(kernel/sched/fair.c:3735): No description found for parameter 'env'
    Warning(kernel/sched/fair.c:3735): Excess function parameter 'sd' description in 'update_sd_pick_busiest'
    Warning(kernel/sched/fair.c:3735): Excess function parameter 'this_cpu' description in 'update_sd_pick_busiest'
    .. more warnings

    Signed-off-by: Randy Dunlap
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

06 Jun, 2012

2 commits

  • Often when we run into mis-shapen topologies the balance iteration
    fails to update the cpu power properly and we'll end up in /0 traps.

    Always initialize the cpu-power to a semi-sane value so that we can
    at least boot the machine, even if the load-balancer might not
    function correctly.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-3lbhyj25sr169ha7z3qht5na@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Weird topologies can lead to asymmetric domain setups. This needs
    further consideration since these setups are typically non-minimal
    too.

    For now, make it work by adding an extra mask selecting which CPUs
    are allowed to iterate up.

    The topology that triggered it is the one from David Rientjes:

    10 20 20 30
    20 10 20 20
    20 20 10 20
    30 20 20 10

    resulting in boxes that wouldn't even boot.

    Reported-by: David Rientjes
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-3p86l9cuaqnxz7uxsojmz5rm@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

30 May, 2012

3 commits

  • Since nr_cpus_allowed is used outside of sched/rt.c and wants to be
    used outside of there more, move it to a more natural site.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-kr61f02y9brwzkh6x53pdptm@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • We could re-read rq->rt_avg after we validated it was smaller than
    total, invalidating the check and resulting in an unintended negative.

    Signed-off-by: Peter Zijlstra
    Cc: David Rientjes
    Link: http://lkml.kernel.org/r/1337688268.9698.29.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • SD_OVERLAP exists to allow overlapping groups, overlapping groups
    appear in NUMA topologies that aren't fully connected.

    The typical result of not fully connected NUMA is that each cpu (or
    rather node) will have different spans for a particular distance.
    However due to how sched domains are traversed -- only the first cpu
    in the mask goes one level up -- the next level only cares about the
    spans of the cpus that went up.

    Due to this two things were observed to be broken:

    - build_overlap_sched_groups() -- since its possible the cpu we're
    building the groups for exists in multiple (or all) groups, the
    selection criteria of the first group didn't ensure there was a cpu
    for which is was true that cpumask_first(span) == cpu. Thus load-
    balancing would terminate.

    - update_group_power() -- assumed that the cpu span of the first
    group of the domain was covered by all groups of the child domain.
    The above explains why this isn't true, so deal with it.

    Signed-off-by: Peter Zijlstra
    Cc: David Rientjes
    Link: http://lkml.kernel.org/r/1337788843.9783.14.camel@laptop
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

17 May, 2012

1 commit

  • It's been broken forever (i.e. it's not scheduling in a power
    aware fashion), as reported by Suresh and others sending
    patches, and nobody cares enough to fix it properly ...
    so remove it to make space free for something better.

    There's various problems with the code as it stands today, first
    and foremost the user interface which is bound to topology
    levels and has multiple values per level. This results in a
    state explosion which the administrator or distro needs to
    master and almost nobody does.

    Furthermore large configuration state spaces aren't good, it
    means the thing doesn't just work right because it's either
    under so many impossibe to meet constraints, or even if
    there's an achievable state workloads have to be aware of
    it precisely and can never meet it for dynamic workloads.

    So pushing this kind of decision to user-space was a bad idea
    even with a single knob - it's exponentially worse with knobs
    on every node of the topology.

    There is a proposal to replace the user interface with a single
    3 state knob:

    sched_balance_policy := { performance, power, auto }

    where 'auto' would be the preferred default which looks at things
    like Battery/AC mode and possible cpufreq state or whatever the hw
    exposes to show us power use expectations - but there's been no
    progress on it in the past many months.

    Aside from that, the actual implementation of the various knobs
    is known to be broken. There have been sporadic attempts at
    fixing things but these always stop short of reaching a mergable
    state.

    Therefore this wholesale removal with the hopes of spurring
    people who care to come forward once again and work on a
    coherent replacement.

    Signed-off-by: Peter Zijlstra
    Cc: Suresh Siddha
    Cc: Arjan van de Ven
    Cc: Vincent Guittot
    Cc: Vaidyanathan Srinivasan
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/1326104915.2442.53.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

14 May, 2012

3 commits

  • Group imbalance is meant to deal with situations where affinity masks
    and sched domains don't align well, such as 3 cpus from one group and
    6 from another. In this case the domain based balancer will want to
    put an equal amount of tasks on each side even though they don't have
    equal cpus.

    Currently group_imb is set whenever two cpus of a group have a weight
    difference of at least one avg task and the heaviest cpu has at least
    two tasks. A group with imbalance set will always be picked as busiest
    and a balance pass will be forced.

    The problem is that even if there are no affinity masks this stuff can
    trigger and cause weird balancing decisions, eg. the observed
    behaviour was that of 6 cpus, 5 had 2 and 1 had 3 tasks, due to the
    difference of 1 avg load (they all had the same weight) and nr_running
    being >1 the group_imbalance logic triggered and did the weird thing
    of pulling more load instead of trying to move the 1 excess task to
    the other domain of 6 cpus that had 5 cpu with 2 tasks and 1 cpu with
    1 task.

    Curb the group_imbalance stuff by making the nr_running condition
    weaker by also tracking the min_nr_running and using the difference in
    nr_running over the set instead of the absolute max nr_running.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-9s7dedozxo8kjsb9kqlrukkf@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • While investigating why the load-balancer did funny I found that the
    rq->cpu_load[] tables were completely screwy.. a bit more digging
    revealed that the updates that got through were missing ticks followed
    by a catchup of 2 ticks.

    The catchup assumes the cpu was idle during that time (since only nohz
    can cause missed ticks and the machine is idle etc..) this means that
    esp. the higher indices were significantly lower than they ought to
    be.

    The reason for this is that its not correct to compare against jiffies
    on every jiffy on any other cpu than the cpu that updates jiffies.

    This patch cludges around it by only doing the catch-up stuff from
    nohz_idle_balance() and doing the regular stuff unconditionally from
    the tick.

    Signed-off-by: Peter Zijlstra
    Cc: pjt@google.com
    Cc: Venkatesh Pallipadi
    Link: http://lkml.kernel.org/n/tip-tp4kj18xdd5aj4vvj0qg55s2@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Patches c22402a2f ("sched/fair: Let minimally loaded cpu balance the
    group") and 0ce90475 ("sched/fair: Add some serialization to the
    sched_domain load-balance walk") are horribly broken so revert them.

    The problem is that while it sounds good to have the minimally loaded
    cpu do the pulling of more load, the way we walk the domains there is
    absolutely no guarantee this cpu will actually get to the domain. In
    fact its very likely it wont. Therefore the higher up the tree we get,
    the less likely it is we'll balance at all.

    The first of mask always walks up, while sucky in that it accumulates
    load on the first cpu and needs extra passes to spread it out at least
    guarantees a cpu gets up that far and load-balancing happens at all.

    Since its now always the first and idle cpus should always be able to
    balance so they get a task as fast as possible we can also do away
    with the added serialization.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-rpuhs5s56aiv1aw7khv9zkw6@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

09 May, 2012

4 commits

  • More function argument passing reduction.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-v66ivjfqdiqdso01lqgqx6qf@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Since the sched_domain walk is completely unserialized (!SD_SERIALIZE)
    it is possible that multiple cpus in the group get elected to do the
    next level. Avoid this by adding some serialization.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-vqh9ai6s0ewmeakjz80w4qz6@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Currently we let the leftmost (or first idle) cpu ascend the
    sched_domain tree and perform load-balancing. The result is that the
    busiest cpu in the group might be performing this function and pull
    more load to itself. The next load balance pass will then try to
    equalize this again.

    Change this to pick the least loaded cpu to perform higher domain
    balancing.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-v8zlrmgmkne3bkcy9dej1fvm@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Since there's a PID space limit of 30bits (see
    futex.h:FUTEX_TID_MASK) and allocating that many tasks (assuming a
    lower bound of 2 pages per task) would still take 8T of memory it
    seems reasonable to say that unsigned int is sufficient for
    rq->nr_running.

    When we do get anywhere near that amount of tasks I suspect other
    things would go funny, load-balancer load computations would really
    need to be hoisted to 128bit etc.

    So save a few bytes and convert rq->nr_running and friends to
    unsigned int.

    Suggested-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-y3tvyszjdmbibade5bw8zl81@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

26 Apr, 2012

1 commit

  • Commits 367456c756a6 ("sched: Ditch per cgroup task lists for
    load-balancing") and 5d6523ebd ("sched: Fix load-balance wreckage")
    left some more wreckage.

    By setting loop_max unconditionally to ->nr_running load-balancing
    could take a lot of time on very long runqueues (hackbench!). So keep
    the sysctl as max limit of the amount of tasks we'll iterate.

    Furthermore, the min load filter for migration completely fails with
    cgroups since inequality in per-cpu state can easily lead to such
    small loads :/

    Furthermore the change to add new tasks to the tail of the queue
    instead of the head seems to have some effect.. not quite sure I
    understand why.

    Combined these fixes solve the huge hackbench regression reported by
    Tim when hackbench is ran in a cgroup.

    Reported-by: Tim Chen
    Acked-by: Tim Chen
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/1335365763.28150.267.camel@twins
    [ got rid of the CONFIG_PREEMPT tuning and made small readability edits ]
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

30 Mar, 2012

1 commit


23 Mar, 2012

1 commit

  • kernel/sched/fair.c:420: warning: 'account_cfs_rq_runtime' declared inline after being called
    kernel/sched/fair.c:420: warning: previous declaration of 'account_cfs_rq_runtime' was here
    kernel/sched/fair.c:1165: warning: 'return_cfs_rq_runtime' declared inlineafter being called
    kernel/sched/fair.c:1165: warning: previous declaration of 'return_cfs_rq_runtime' was here

    Reported-by: Andrew Morton
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20120321200717.49BB4A024E@akpm.mtv.corp.google.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

21 Mar, 2012

1 commit

  • Pull scheduler changes for v3.4 from Ingo Molnar

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
    printk: Make it compile with !CONFIG_PRINTK
    sched/x86: Fix overflow in cyc2ns_offset
    sched: Fix nohz load accounting -- again!
    sched: Update yield() docs
    printk/sched: Introduce special printk_sched() for those awkward moments
    sched/nohz: Correctly initialize 'next_balance' in 'nohz' idle balancer
    sched: Cleanup cpu_active madness
    sched: Fix load-balance wreckage
    sched: Clean up parameter passing of proc_sched_autogroup_set_nice()
    sched: Ditch per cgroup task lists for load-balancing
    sched: Rename load-balancing fields
    sched: Move load-balancing arguments into helper struct
    sched/rt: Do not submit new work when PI-blocked
    sched/rt: Prevent idle task boosting
    sched/wait: Add __wake_up_all_locked() API
    sched/rt: Document scheduler related skip-resched-check sites
    sched/rt: Use schedule_preempt_disabled()
    sched/rt: Add schedule_preempt_disabled()
    sched/rt: Do not throttle when PI boosting
    sched/rt: Keep period timer ticking when rt throttling is active
    ...

    Linus Torvalds
     

13 Mar, 2012

2 commits

  • The 'next_balance' field of 'nohz' idle balancer must be initialized
    to jiffies. Since jiffies is initialized to negative 300 seconds the
    'nohz' idle balancer does not run for the first 300s (5mins) after
    bootup. If no new processes are spawed or no idle cycles happen, the
    load on the cpus will remain unbalanced for that duration.

    Signed-off-by: Diwakar Tundlam
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1DD7BFEDD3147247B1355BEFEFE4665237994F30EF@HQMAIL04.nvidia.com
    Signed-off-by: Ingo Molnar

    Diwakar Tundlam
     
  • Commit 367456c ("sched: Ditch per cgroup task lists for
    load-balancing") completely wrecked load-balancing due to
    a few silly mistakes.

    Correct those and remove more pointless code.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-zk04ihygwxn7qqrlpaf73b0r@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

05 Mar, 2012

1 commit


01 Mar, 2012

2 commits

  • Per cgroup load-balance has numerous problems, chief amongst them that
    there is no real sane order in them. So stop pretending it makes sense
    and enqueue all tasks on a single list.

    This also allows us to more easily fix the fwd progress issue
    uncovered by the lock-break stuff. Rotate the list on failure to
    migreate and limit the total iterations to nr_running (which with
    releasing the lock isn't strictly accurate but close enough).

    Also add a filter that skips very light tasks on the first attempt
    around the list, this attempts to avoid shooting whole cgroups around
    without affecting over balance.

    Signed-off-by: Peter Zijlstra
    Cc: pjt@google.com
    Link: http://lkml.kernel.org/n/tip-tx8yqydc7eimgq7i4rkc3a4g@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • s/env->this_/env->dst_/g
    s/env->busiest_/env->src_/g
    s/pull_task/move_task/g

    Makes everything clearer.

    Signed-off-by: Peter Zijlstra
    Cc: pjt@google.com
    Link: http://lkml.kernel.org/n/tip-0yvgms8t8x962drpvl0fu0kk@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra