23 Jun, 2010

1 commit

  • The task_group() function returns a pointer that must be protected
    by either RCU, the ->alloc_lock, or the cgroup lock (see the
    rcu_dereference_check() in task_subsys_state(), which is invoked by
    task_group()). The wake_affine() function currently does none of these,
    which means that a concurrent update would be within its rights to free
    the structure returned by task_group(). Because wake_affine() uses this
    structure only to compute load-balancing heuristics, there is no reason
    to acquire either of the two locks.

    Therefore, this commit introduces an RCU read-side critical section that
    starts before the first call to task_group() and ends after the last use
    of the "tg" pointer returned from task_group(). Thanks to Li Zefan for
    pointing out the need to extend the RCU read-side critical section from
    that proposed by the original patch.

    Signed-off-by: Daniel J Blueman
    Signed-off-by: Paul E. McKenney

    Daniel J Blueman
     

01 Jun, 2010

1 commit

  • Mike reports that since e9e9250b (sched: Scale down cpu_power due to RT
    tasks), wake_affine() goes funny on RT tasks due to them still having a
    !0 weight and wake_affine() still subtracts that from the rq weight.

    Since nobody should be using se->weight for RT tasks, set the value to
    zero. Also, since we now use ->cpu_power to normalize rq weights to
    account for RT cpu usage, add that factor into the imbalance computation.

    Reported-by: Mike Galbraith
    Tested-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

07 May, 2010

1 commit

  • Currently migration_thread is serving three purposes - migration
    pusher, context to execute active_load_balance() and forced context
    switcher for expedited RCU synchronize_sched. All three roles are
    hardcoded into migration_thread() and determining which job is
    scheduled is slightly messy.

    This patch kills migration_thread and replaces all three uses with
    cpu_stop. The three different roles of migration_thread() are
    splitted into three separate cpu_stop callbacks -
    migration_cpu_stop(), active_load_balance_cpu_stop() and
    synchronize_sched_expedited_cpu_stop() - and each use case now simply
    asks cpu_stop to execute the callback as necessary.

    synchronize_sched_expedited() was implemented with private
    preallocated resources and custom multi-cpu queueing and waiting
    logic, both of which are provided by cpu_stop.
    synchronize_sched_expedited_count is made atomic and all other shared
    resources along with the mutex are dropped.

    synchronize_sched_expedited() also implemented a check to detect cases
    where not all the callback got executed on their assigned cpus and
    fall back to synchronize_sched(). If called with cpu hotplug blocked,
    cpu_stop already guarantees that and the condition cannot happen;
    otherwise, stop_machine() would break. However, this patch preserves
    the paranoid check using a cpumask to record on which cpus the stopper
    ran so that it can serve as a bisection point if something actually
    goes wrong theree.

    Because the internal execution state is no longer visible,
    rcu_expedited_torture_stats() is removed.

    This patch also renames cpu_stop threads to from "stopper/%d" to
    "migration/%d". The names of these threads ultimately don't matter
    and there's no reason to make unnecessary userland visible changes.

    With this patch applied, stop_machine() and sched now share the same
    resources. stop_machine() is faster without wasting any resources and
    sched migration users are much cleaner.

    Signed-off-by: Tejun Heo
    Acked-by: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Dipankar Sarma
    Cc: Josh Triplett
    Cc: Paul E. McKenney
    Cc: Oleg Nesterov
    Cc: Dimitri Sivanich

    Tejun Heo
     

23 Apr, 2010

2 commits

  • Issues in the current select_idle_sibling() logic in select_task_rq_fair()
    in the context of a task wake-up:

    a) Once we select the idle sibling, we use that domain (spanning the cpu that
    the task is currently woken-up and the idle sibling that we found) in our
    wake_affine() decisions. This domain is completely different from the
    domain(we are supposed to use) that spans the cpu that the task currently
    woken-up and the cpu where the task previously ran.

    b) We do select_idle_sibling() check only for the cpu that the task is
    currently woken-up on. If select_task_rq_fair() selects the previously run
    cpu for waking the task, doing a select_idle_sibling() check
    for that cpu also helps and we don't do this currently.

    c) In the scenarios where the cpu that the task is woken-up is busy but
    with its HT siblings are idle, we are selecting the task be woken-up
    on the idle HT sibling instead of a core that it previously ran
    and currently completely idle. i.e., we are not taking decisions based on
    wake_affine() but directly selecting an idle sibling that can cause
    an imbalance at the SMT/MC level which will be later corrected by the
    periodic load balancer.

    Fix this by first going through the load imbalance calculations using
    wake_affine() and once we make a decision of woken-up cpu vs previously-ran cpu,
    then choose a possible idle sibling for waking up the task on.

    Signed-off-by: Suresh Siddha
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     
  • Dave reported that his large SPARC machines spend lots of time in
    hweight64(), try and optimize some of those needless cpumask_weight()
    invocations (esp. with the large offstack cpumasks these are very
    expensive indeed).

    Reported-by: David Miller
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

03 Apr, 2010

3 commits

  • In order to reduce the dependency on TASK_WAKING rework the enqueue
    interface to support a proper flags field.

    Replace the int wakeup, bool head arguments with an int flags argument
    and create the following flags:

    ENQUEUE_WAKEUP - the enqueue is a wakeup of a sleeping task,
    ENQUEUE_WAKING - the enqueue has relative vruntime due to
    having sched_class::task_waking() called,
    ENQUEUE_HEAD - the waking task should be places on the head
    of the priority queue (where appropriate).

    For symmetry also convert sched_class::dequeue() to a flags scheme.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Oleg noticed a few races with the TASK_WAKING usage on fork.

    - since TASK_WAKING is basically a spinlock, it should be IRQ safe
    - since we set TASK_WAKING (*) without holding rq->lock it could
    be there still is a rq->lock holder, thereby not actually
    providing full serialization.

    (*) in fact we clear PF_STARTING, which in effect enables TASK_WAKING.

    Cure the second issue by not setting TASK_WAKING in sched_fork(), but
    only temporarily in wake_up_new_task() while calling select_task_rq().

    Cure the first by holding rq->lock around the select_task_rq() call,
    this will disable IRQs, this however requires that we push down the
    rq->lock release into select_task_rq_fair()'s cgroup stuff.

    Because select_task_rq_fair() still needs to drop the rq->lock we
    cannot fully get rid of TASK_WAKING.

    Reported-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Merge reason: update to latest upstream

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

12 Mar, 2010

10 commits

  • Disabling affine wakeups is too horrible to contemplate. Remove the feature flag.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • This features has been enabled for quite a while, after testing showed that
    easing preemption for light tasks was harmful to high priority threads.

    Remove the feature flag.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • This feature never earned its keep, remove it.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Our preemption model relies too heavily on sleeper fairness to disable it
    without dire consequences. Remove the feature, and save a branch or two.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • This feature hasn't been enabled in a long time, remove effectively dead code.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Don't bother with selection when the current cpu is idle. Recent load
    balancing changes also make it no longer necessary to check wake_affine()
    success before returning the selected sibling, so we now always use it.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Allow LAST_BUDDY to kick in sooner, improving cache utilization as soon as
    a second buddy pair arrives on scene. The cost is latency starting to climb
    sooner, the tbenefit for tbench 8 on my Q6600 box is ~2%. No detrimental
    effects noted in normal idesktop usage.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Now that we no longer depend on the clock being updated prior to enqueueing
    on migratory wakeup, we can clean up a bit, placing calls to update_rq_clock()
    exactly where they are needed, ie on enqueue, dequeue and schedule events.

    In the case of a freshly enqueued task immediately preempting, we can skip the
    update during preemption, as the clock was just updated by the enqueue event.
    We also save an unneeded call during a migratory wakeup by not updating the
    previous runqueue, where update_curr() won't be invoked.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Both avg_overlap and avg_wakeup had an inherent problem in that their accuracy
    was detrimentally affected by cross-cpu wakeups, this because we are missing
    the necessary call to update_curr(). This can't be fixed without increasing
    overhead in our already too fat fastpath.

    Additionally, with recent load balancing changes making us prefer to place tasks
    in an idle cache domain (which is good for compute bound loads), communicating
    tasks suffer when a sync wakeup, which would enable affine placement, is turned
    into a non-sync wakeup by SYNC_LESS. With one task on the runqueue, wake_affine()
    rejects the affine wakeup request, leaving the unfortunate where placed, taking
    frequent cache misses.

    Remove it, and recover some fastpath cycles.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Testing the load which led to this heuristic (nfs4 kbuild) shows that it has
    outlived it's usefullness. With intervening load balancing changes, I cannot
    see any difference with/without, so recover there fastpath cycles.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     

11 Mar, 2010

1 commit

  • Put all statistic fields of sched_entity in one struct, sched_statistics,
    and embed it into sched_entity.

    This change allows to memset the sched_statistics to 0 when needed (for
    instance when forking), avoiding bugs of non initialized fields.

    Signed-off-by: Lucas De Marchi
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Lucas De Marchi
     

01 Mar, 2010

1 commit

  • Make rcu_dereference() of runqueue data structures be
    rcu_dereference_sched().

    Located-by: Ingo Molnar
    Signed-off-by: Paul E. McKenney
    Cc: laijs@cn.fujitsu.com
    Cc: dipankar@in.ibm.com
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josh@joshtriplett.org
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    Cc: Valdis.Kletnieks@vt.edu
    Cc: dhowells@redhat.com
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

26 Feb, 2010

1 commit

  • On platforms like dual socket quad-core platform, the scheduler load
    balancer is not detecting the load imbalances in certain scenarios. This
    is leading to scenarios like where one socket is completely busy (with
    all the 4 cores running with 4 tasks) and leaving another socket
    completely idle. This causes performance issues as those 4 tasks share
    the memory controller, last-level cache bandwidth etc. Also we won't be
    taking advantage of turbo-mode as much as we would like, etc.

    Some of the comparisons in the scheduler load balancing code are
    comparing the "weighted cpu load that is scaled wrt sched_group's
    cpu_power" with the "weighted average load per task that is not scaled
    wrt sched_group's cpu_power". While this has probably been broken for a
    longer time (for multi socket numa nodes etc), the problem got aggrevated
    via this recent change:

    |
    | commit f93e65c186ab3c05ce2068733ca10e34fd00125e
    | Author: Peter Zijlstra
    | Date: Tue Sep 1 10:34:32 2009 +0200
    |
    | sched: Restore __cpu_power to a straight sum of power
    |

    Also with this change, the sched group cpu power alone no longer reflects
    the group capacity that is needed to implement MC, MT performance
    (default) and power-savings (user-selectable) policies.

    We need to use the computed group capacity (sgs.group_capacity, that is
    computed using the SD_PREFER_SIBLING logic in update_sd_lb_stats()) to
    find out if the group with the max load is above its capacity and how
    much load to move etc.

    Reported-by: Ma Ling
    Initial-Analysis-by: Zhang, Yanmin
    Signed-off-by: Suresh Siddha
    [ -v2: build fix ]
    Signed-off-by: Peter Zijlstra
    Cc: # [2.6.32.x, 2.6.33.x]
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     

16 Feb, 2010

1 commit


08 Feb, 2010

1 commit


23 Jan, 2010

1 commit

  • The ability of enqueueing a task to the head of a SCHED_FIFO priority
    list is required to fix some violations of POSIX scheduling policy.

    Extend the related functions with a "head" argument.

    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra
    Tested-by: Carsten Emde
    Tested-by: Mathias Weber
    LKML-Reference:

    Thomas Gleixner
     

21 Jan, 2010

11 commits

  • We want to update the sched_group_powers when balance_cpu == this_cpu.

    Currently the group powers are updated only if the balance_cpu is the
    first CPU in the local group. But balance_cpu = this_cpu could also be
    the first idle cpu in the group. Hence fix the place where the group
    powers are updated.

    Signed-off-by: Gautham R Shenoy
    Signed-off-by: Joel Schopp
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Gautham R Shenoy
     
  • Since all load_balance() callers will have !NULL balance parameters we
    can now assume so and remove a few checks.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The two functions: load_balance{,_newidle}() are very similar, with the
    following differences:

    - rq->lock usage
    - sb->balance_interval updates
    - *balance check

    So remove the load_balance_newidle() call with load_balance(.idle =
    CPU_NEWLY_IDLE), explicitly unlock the rq->lock before calling (would be
    done by double_lock_balance() anyway), and ignore the other differences
    for now.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • load_balance() and load_balance_newidle() look remarkably similar, one
    key point they differ in is the condition on when to active balance.

    So split out that logic into a separate function.

    One side effect is that previously load_balance_newidle() used to fail
    and return -1 under these conditions, whereas now it doesn't. I've not
    yet fully figured out the whole -1 return case for either
    load_balance{,_newidle}().

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Since load-balancing can hold rq->locks for quite a long while, allow
    breaking out early when there is lock contention.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Move code around to get rid of fwd declarations.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Again, since we only iterate the fair class, remove the abstraction.

    Since this is the last user of the rq_iterator, remove all that too.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Since we only ever iterate the fair class, do away with this abstraction.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Take out the sched_class methods for load-balancing.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Straight fwd code movement.

    Since non of the load-balance abstractions are used anymore, do away with
    them and simplify the code some. In preparation move the code around.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • SD_PREFER_SIBLING is set at the CPU domain level if power saving isn't
    enabled, leading to many cache misses on large machines as we traverse
    looking for an idle shared cache to wake to. Change the enabler of
    select_idle_sibling() to SD_SHARE_PKG_RESOURCES, and enable same at the
    sibling domain level.

    Reported-by: Lin Ming
    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     

17 Jan, 2010

1 commit

  • kernel/sched: don't expose local functions

    The get_rr_interval_* functions are all class methods of
    struct sched_class. They are not exported so make them
    static.

    Signed-off-by: H Hartley Sweeten
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    H Hartley Sweeten
     

17 Dec, 2009

2 commits

  • In order to remove the cfs_rq dependency from set_task_cpu() we
    need to ensure the task is cfs_rq invariant for all callsites.

    The simple approach is to substract cfs_rq->min_vruntime from
    se->vruntime on dequeue, and add cfs_rq->min_vruntime on
    enqueue.

    However, this has the downside of breaking FAIR_SLEEPERS since
    we loose the old vruntime as we only maintain the relative
    position.

    To solve this, we observe that we only migrate runnable tasks,
    we do this using deactivate_task(.sleep=0) and
    activate_task(.wakeup=0), therefore we can restrain the
    min_vruntime invariance to that state.

    The only other case is wakeup balancing, since we want to
    maintain the old vruntime we cannot make it relative on dequeue,
    but since we don't migrate inactive tasks, we can do so right
    before we activate it again.

    This is where we need the new pre-wakeup hook, we need to call
    this while still holding the old rq->lock. We could fold it into
    ->select_task_rq(), but since that has multiple callsites and
    would obfuscate the locking requirements, that seems like a
    fudge.

    This leaves the fork() case, simply make sure that ->task_fork()
    leaves the ->vruntime in a relative state.

    This covers all cases where set_task_cpu() gets called, and
    ensures it sees a relative vruntime.

    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • We should skip !SD_LOAD_BALANCE domains.

    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    LKML-Reference:
    CC: stable@kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

15 Dec, 2009

1 commit


09 Dec, 2009

1 commit

  • The normalized values are also recalculated in case the scaling factor
    changes.

    This patch updates the internally used scheduler tuning values that are
    normalized to one cpu in case a user sets new values via sysfs.

    Together with patch 2 of this series this allows to let user configured
    values scale (or not) to cpu add/remove events taking place later.

    Signed-off-by: Christian Ehrhardt
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    [ v2: fix warning ]
    Signed-off-by: Ingo Molnar

    Christian Ehrhardt