01 Mar, 2010

1 commit

  • Make rcu_dereference() of runqueue data structures be
    rcu_dereference_sched().

    Located-by: Ingo Molnar
    Signed-off-by: Paul E. McKenney
    Cc: laijs@cn.fujitsu.com
    Cc: dipankar@in.ibm.com
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josh@joshtriplett.org
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    Cc: Valdis.Kletnieks@vt.edu
    Cc: dhowells@redhat.com
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

26 Feb, 2010

1 commit

  • On platforms like dual socket quad-core platform, the scheduler load
    balancer is not detecting the load imbalances in certain scenarios. This
    is leading to scenarios like where one socket is completely busy (with
    all the 4 cores running with 4 tasks) and leaving another socket
    completely idle. This causes performance issues as those 4 tasks share
    the memory controller, last-level cache bandwidth etc. Also we won't be
    taking advantage of turbo-mode as much as we would like, etc.

    Some of the comparisons in the scheduler load balancing code are
    comparing the "weighted cpu load that is scaled wrt sched_group's
    cpu_power" with the "weighted average load per task that is not scaled
    wrt sched_group's cpu_power". While this has probably been broken for a
    longer time (for multi socket numa nodes etc), the problem got aggrevated
    via this recent change:

    |
    | commit f93e65c186ab3c05ce2068733ca10e34fd00125e
    | Author: Peter Zijlstra
    | Date: Tue Sep 1 10:34:32 2009 +0200
    |
    | sched: Restore __cpu_power to a straight sum of power
    |

    Also with this change, the sched group cpu power alone no longer reflects
    the group capacity that is needed to implement MC, MT performance
    (default) and power-savings (user-selectable) policies.

    We need to use the computed group capacity (sgs.group_capacity, that is
    computed using the SD_PREFER_SIBLING logic in update_sd_lb_stats()) to
    find out if the group with the max load is above its capacity and how
    much load to move etc.

    Reported-by: Ma Ling
    Initial-Analysis-by: Zhang, Yanmin
    Signed-off-by: Suresh Siddha
    [ -v2: build fix ]
    Signed-off-by: Peter Zijlstra
    Cc: # [2.6.32.x, 2.6.33.x]
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     

16 Feb, 2010

1 commit


08 Feb, 2010

1 commit


23 Jan, 2010

1 commit

  • The ability of enqueueing a task to the head of a SCHED_FIFO priority
    list is required to fix some violations of POSIX scheduling policy.

    Extend the related functions with a "head" argument.

    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra
    Tested-by: Carsten Emde
    Tested-by: Mathias Weber
    LKML-Reference:

    Thomas Gleixner
     

21 Jan, 2010

11 commits

  • We want to update the sched_group_powers when balance_cpu == this_cpu.

    Currently the group powers are updated only if the balance_cpu is the
    first CPU in the local group. But balance_cpu = this_cpu could also be
    the first idle cpu in the group. Hence fix the place where the group
    powers are updated.

    Signed-off-by: Gautham R Shenoy
    Signed-off-by: Joel Schopp
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Gautham R Shenoy
     
  • Since all load_balance() callers will have !NULL balance parameters we
    can now assume so and remove a few checks.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The two functions: load_balance{,_newidle}() are very similar, with the
    following differences:

    - rq->lock usage
    - sb->balance_interval updates
    - *balance check

    So remove the load_balance_newidle() call with load_balance(.idle =
    CPU_NEWLY_IDLE), explicitly unlock the rq->lock before calling (would be
    done by double_lock_balance() anyway), and ignore the other differences
    for now.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • load_balance() and load_balance_newidle() look remarkably similar, one
    key point they differ in is the condition on when to active balance.

    So split out that logic into a separate function.

    One side effect is that previously load_balance_newidle() used to fail
    and return -1 under these conditions, whereas now it doesn't. I've not
    yet fully figured out the whole -1 return case for either
    load_balance{,_newidle}().

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Since load-balancing can hold rq->locks for quite a long while, allow
    breaking out early when there is lock contention.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Move code around to get rid of fwd declarations.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Again, since we only iterate the fair class, remove the abstraction.

    Since this is the last user of the rq_iterator, remove all that too.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Since we only ever iterate the fair class, do away with this abstraction.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Take out the sched_class methods for load-balancing.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Straight fwd code movement.

    Since non of the load-balance abstractions are used anymore, do away with
    them and simplify the code some. In preparation move the code around.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • SD_PREFER_SIBLING is set at the CPU domain level if power saving isn't
    enabled, leading to many cache misses on large machines as we traverse
    looking for an idle shared cache to wake to. Change the enabler of
    select_idle_sibling() to SD_SHARE_PKG_RESOURCES, and enable same at the
    sibling domain level.

    Reported-by: Lin Ming
    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     

17 Jan, 2010

1 commit

  • kernel/sched: don't expose local functions

    The get_rr_interval_* functions are all class methods of
    struct sched_class. They are not exported so make them
    static.

    Signed-off-by: H Hartley Sweeten
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    H Hartley Sweeten
     

17 Dec, 2009

2 commits

  • In order to remove the cfs_rq dependency from set_task_cpu() we
    need to ensure the task is cfs_rq invariant for all callsites.

    The simple approach is to substract cfs_rq->min_vruntime from
    se->vruntime on dequeue, and add cfs_rq->min_vruntime on
    enqueue.

    However, this has the downside of breaking FAIR_SLEEPERS since
    we loose the old vruntime as we only maintain the relative
    position.

    To solve this, we observe that we only migrate runnable tasks,
    we do this using deactivate_task(.sleep=0) and
    activate_task(.wakeup=0), therefore we can restrain the
    min_vruntime invariance to that state.

    The only other case is wakeup balancing, since we want to
    maintain the old vruntime we cannot make it relative on dequeue,
    but since we don't migrate inactive tasks, we can do so right
    before we activate it again.

    This is where we need the new pre-wakeup hook, we need to call
    this while still holding the old rq->lock. We could fold it into
    ->select_task_rq(), but since that has multiple callsites and
    would obfuscate the locking requirements, that seems like a
    fudge.

    This leaves the fork() case, simply make sure that ->task_fork()
    leaves the ->vruntime in a relative state.

    This covers all cases where set_task_cpu() gets called, and
    ensures it sees a relative vruntime.

    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • We should skip !SD_LOAD_BALANCE domains.

    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    LKML-Reference:
    CC: stable@kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

15 Dec, 2009

1 commit


09 Dec, 2009

9 commits

  • The normalized values are also recalculated in case the scaling factor
    changes.

    This patch updates the internally used scheduler tuning values that are
    normalized to one cpu in case a user sets new values via sysfs.

    Together with patch 2 of this series this allows to let user configured
    values scale (or not) to cpu add/remove events taking place later.

    Signed-off-by: Christian Ehrhardt
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    [ v2: fix warning ]
    Signed-off-by: Ingo Molnar

    Christian Ehrhardt
     
  • As scaling now takes place on all kind of cpu add/remove events a user
    that configures values via proc should be able to configure if his set
    values are still rescaled or kept whatever happens.

    As the comments state that log2 was just a second guess that worked the
    interface is not just designed for on/off, but to choose a scaling type.
    Currently this allows none, log and linear, but more important it allwos
    us to keep the interface even if someone has an even better idea how to
    scale the values.

    Signed-off-by: Christian Ehrhardt
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Christian Ehrhardt
     
  • Based on Peter Zijlstras patch suggestion this enables recalculation of
    the scheduler tunables in response of a change in the number of cpus. It
    also adds a max of eight cpus that are considered in that scaling.

    Signed-off-by: Christian Ehrhardt
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Christian Ehrhardt
     
  • As Nick pointed out, and realized by myself when doing:
    sched: Fix balance vs hotplug race
    the patch:
    sched: for_each_domain() vs RCU

    is wrong, sched_domains are freed after synchronize_sched(), which
    means disabling preemption is enough.

    Reported-by: Nick Piggin
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • WAKEUP_RUNNING was an experiment, not sure why that ever ended up being
    merged...

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Streamline the wakeup preemption code a bit, unifying the preempt path
    so that they all do the same.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • If a RT task is woken up while a non-RT task is running,
    check_preempt_wakeup() is called to check whether the new task can
    preempt the old task. The function returns quickly without going deeper
    because it is apparent that a RT task can always preempt a non-RT task.

    In this situation, check_preempt_wakeup() always calls update_curr() to
    update vruntime value of the currently running task. However, the
    function call is unnecessary and redundant at that moment because (1) a
    non-RT task can always be preempted by a RT task regardless of its
    vruntime value, and (2) update_curr() will be called shortly when the
    context switch between two occurs.

    By moving update_curr() in check_preempt_wakeup(), we can avoid
    redundant call to update_curr(), slightly reducing the time taken to
    wake up RT tasks.

    Signed-off-by: Jupyung Lee
    [ Place update_curr() right before the wake_preempt_entity() call, which
    is the only thing that relies on the updated vruntime ]
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Jupyung Lee
     
  • Currently we try to do task placement in wake_up_new_task() after we do
    the load-balance pass in sched_fork(). This yields complicated semantics
    in that we have to deal with tasks on different RQs and the
    set_task_cpu() calls in copy_process() and sched_fork()

    Rename ->task_new() to ->task_fork() and call it from sched_fork()
    before the balancing, this gives the policy a clear point to place the
    task.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • sched_rr_get_param calls
    task->sched_class->get_rr_interval(task) without protection
    against a concurrent sched_setscheduler() call which modifies
    task->sched_class.

    Serialize the access with task_rq_lock(task) and hand the rq
    pointer into get_rr_interval() as it's needed at least in the
    sched_fair implementation.

    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

26 Nov, 2009

1 commit


24 Nov, 2009

1 commit

  • Branch hint profiling on my nehalem machine showed 90%
    incorrect branch hints:

    15728471 158903754 90 pick_next_task_fair
    sched_fair.c 1555

    Signed-off-by: Tim Blechmann
    Cc: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Tim Blechmann
     

13 Nov, 2009

2 commits


05 Nov, 2009

2 commits

  • Ingo Molnar reported:

    [ 26.804000] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/10
    [ 26.808000] caller is vmstat_update+0x26/0x70
    [ 26.812000] Pid: 10, comm: events/1 Not tainted 2.6.32-rc5 #6887
    [ 26.816000] Call Trace:
    [ 26.820000] [] ? printk+0x28/0x3c
    [ 26.824000] [] debug_smp_processor_id+0xf0/0x110
    [ 26.824000] mount used greatest stack depth: 1464 bytes left
    [ 26.828000] [] vmstat_update+0x26/0x70
    [ 26.832000] [] worker_thread+0x188/0x310
    [ 26.836000] [] ? worker_thread+0x127/0x310
    [ 26.840000] [] ? autoremove_wake_function+0x0/0x60
    [ 26.844000] [] ? worker_thread+0x0/0x310
    [ 26.848000] [] kthread+0x7c/0x90
    [ 26.852000] [] ? kthread+0x0/0x90
    [ 26.856000] [] kernel_thread_helper+0x7/0x10
    [ 26.860000] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/10
    [ 26.864000] caller is vmstat_update+0x3c/0x70

    Because this commit:

    a1f84a3: sched: Check for an idle shared cache in select_task_rq_fair()

    broke ->cpus_allowed.

    Signed-off-by: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: arjan@infradead.org
    Cc:
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • When waking affine, check for an idle shared cache, and if
    found, wake to that CPU/sibling instead of the waker's CPU.

    This improves pgsql+oltp ramp up by roughly 8%. Possibly more
    for other loads, depending on overlap. The trade-off is a
    roughly 1% peak downturn if tasks are truly synchronous.

    Signed-off-by: Mike Galbraith
    Cc: Arjan van de Ven
    Cc: Peter Zijlstra
    Cc:
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     

24 Oct, 2009

1 commit

  • This patch restores the effectiveness of LAST_BUDDY in preventing
    pgsql+oltp from collapsing due to wakeup preemption. It also
    switches LAST_BUDDY to exclusively do what it does best, namely
    mitigate the effects of aggressive wakeup preemption, which
    improves vmark throughput markedly, and restores mysql+oltp
    scalability.

    Since buddies are about scalability, enable them beginning at the
    point where we begin expanding sched_latency, namely
    sched_nr_latency. Previously, buddies were cleared aggressively,
    which seriously reduced their effectiveness. Not clearing
    aggressively however, produces a small drop in mysql+oltp
    throughput immediately after peak, indicating that LAST_BUDDY is
    actually doing some harm. This is right at the point where X on the
    desktop in competition with another load wants low latency service.
    Ergo, do not enable until we need to scale.

    To mitigate latency induced by buddies, or by a task just missing
    wakeup preemption, check latency at tick time.

    Last hunk prevents buddies from stymieing BALANCE_NEWIDLE via
    CACHE_HOT_BUDDY.

    Supporting performance tests:

    tip = v2.6.32-rc5-1497-ga525b32
    tipx = NO_GENTLE_FAIR_SLEEPERS NEXT_BUDDY granularity knobs = 31 knobs + 31 buddies
    tip+x = NO_GENTLE_FAIR_SLEEPERS granularity knobs = 31 knobs

    (Three run averages except where noted.)

    vmark:
    ------
    tip 108466 messages per second
    tip+ 125307 messages per second
    tip+x 125335 messages per second
    tipx 117781 messages per second
    2.6.31.3 122729 messages per second

    mysql+oltp:
    -----------
    clients 1 2 4 8 16 32 64 128 256
    ..........................................................................................
    tip 9949.89 18690.20 34801.24 34460.04 32682.88 30765.97 28305.27 25059.64 19548.08
    tip+ 10013.90 18526.84 34900.38 34420.14 33069.83 32083.40 30578.30 28010.71 25605.47
    tipx 9698.71 18002.70 34477.56 33420.01 32634.30 31657.27 29932.67 26827.52 21487.18
    2.6.31.3 8243.11 18784.20 34404.83 33148.38 31900.32 31161.90 29663.81 25995.94 18058.86

    pgsql+oltp:
    -----------
    clients 1 2 4 8 16 32 64 128 256
    ..........................................................................................
    tip 13686.37 26609.25 51934.28 51347.81 49479.51 45312.65 36691.91 26851.57 24145.35
    tip+ (1x) 13907.85 27135.87 52951.98 52514.04 51742.52 50705.43 49947.97 48374.19 46227.94
    tip+x 13906.78 27065.81 52951.19 52542.59 52176.11 51815.94 50838.90 49439.46 46891.00
    tipx 13742.46 26769.81 52351.99 51891.73 51320.79 50938.98 50248.65 48908.70 46553.84
    2.6.31.3 13815.35 26906.46 52683.34 52061.31 51937.10 51376.80 50474.28 49394.47 47003.25

    Signed-off-by: Mike Galbraith
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     

14 Oct, 2009

1 commit

  • Yanmin reported a hackbench regression due to:

    > commit de69a80be32445b0a71e8e3b757e584d7beb90f7
    > Author: Peter Zijlstra
    > Date: Thu Sep 17 09:01:20 2009 +0200
    >
    > sched: Stop buddies from hogging the system

    I really liked de69a80b, and it affecting hackbench shows I wasn't
    crazy ;-)

    So hackbench is a multi-cast, with one sender spraying multiple
    receivers, who in their turn don't spray back.

    This would be exactly the scenario that patch 'cures'. Previously
    we would not clear the last buddy after running the next task,
    allowing the sender to get back to work sooner than it otherwise
    ought to have been, increasing latencies for other tasks.

    Now, since those receivers don't poke back, they don't enforce the
    buddy relation, which means there's nothing to re-elect the sender.

    Cure this by less agressively clearing the buddy stats. Only clear
    buddies when they were not chosen. It should still avoid a buddy
    sticking around long after its served its time.

    Reported-by: "Zhang, Yanmin"
    Signed-off-by: Peter Zijlstra
    CC: Mike Galbraith
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

24 Sep, 2009

1 commit

  • It's unused.

    It isn't needed -- read or write flag is already passed and sysctl
    shouldn't care about the rest.

    It _was_ used in two places at arch/frv for some reason.

    Signed-off-by: Alexey Dobriyan
    Cc: David Howells
    Cc: "Eric W. Biederman"
    Cc: Al Viro
    Cc: Ralf Baechle
    Cc: Martin Schwidefsky
    Cc: Ingo Molnar
    Cc: "David S. Miller"
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

22 Sep, 2009

1 commit


21 Sep, 2009

1 commit