24 Jul, 2008

1 commit

  • * 'sched/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sched: hrtick_enabled() should use cpu_active()
    sched, x86: clean up hrtick implementation
    sched: fix build error, provide partition_sched_domains() unconditionally
    sched: fix warning in inc_rt_tasks() to not declare variable 'rq' if it's not needed
    cpu hotplug: Make cpu_active_map synchronization dependency clear
    cpu hotplug, sched: Introduce cpu_active_map and redo sched domain managment (take 2)
    sched: rework of "prioritize non-migratable tasks over migratable ones"
    sched: reduce stack size in isolated_cpu_setup()
    Revert parts of "ftrace: do not trace scheduler functions"

    Fixed up conflicts in include/asm-x86/thread_info.h (due to the
    TIF_SINGLESTEP unification vs TIF_HRTICK_RESCHED removal) and
    kernel/sched_fair.c (due to cpu_active_map vs for_each_cpu_mask_nr()
    introduction).

    Linus Torvalds
     

20 Jul, 2008

2 commits

  • Ingo Molnar
     
  • random uvesafb failures were reported against Gentoo:

    http://bugs.gentoo.org/show_bug.cgi?id=222799

    and Mihai Moldovan bisected it back to:

    > 8f4d37ec073c17e2d4aa8851df5837d798606d6f is first bad commit
    > commit 8f4d37ec073c17e2d4aa8851df5837d798606d6f
    > Author: Peter Zijlstra
    > Date: Fri Jan 25 21:08:29 2008 +0100
    >
    > sched: high-res preemption tick

    Linus suspected it to be hrtick + vm86 interaction and observed:

    > Btw, Peter, Ingo: I think that commit is doing bad things. They aren't
    > _incorrect_ per se, but they are definitely bad.
    >
    > Why?
    >
    > Using random _TIF_WORK_MASK flags is really impolite for doing
    > "scheduling" work. There's a reason that arch/x86/kernel/entry_32.S
    > special-cases the _TIF_NEED_RESCHED flag: we don't want to exit out of
    > vm86 mode unnecessarily.
    >
    > See the "work_notifysig_v86" label, and how it does that
    > "save_v86_state()" thing etc etc.

    Right, I never liked having to fiddle with those TIF flags. Initially I
    needed it because the hrtimer base lock could not nest in the rq lock.
    That however is fixed these days.

    Currently the only reason left to fiddle with the TIF flags is remote
    wakeups. We cannot program a remote cpu's hrtimer. I've been thinking
    about using the new and improved IPI function call stuff to implement
    hrtimer_start_on().

    However that does require that smp_call_function_single(.wait=0) works
    from interrupt context - /me looks at the latest series from Jens - Yes
    that does seem to be supported, good.

    Here's a stab at cleaning this stuff up ...

    Mihai reported test success as well.

    Signed-off-by: Peter Zijlstra
    Tested-by: Mihai Moldovan
    Cc: Michal Januszewski
    Cc: Antonino Daplas
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

18 Jul, 2008

1 commit

  • This is based on Linus' idea of creating cpu_active_map that prevents
    scheduler load balancer from migrating tasks to the cpu that is going
    down.

    It allows us to simplify domain management code and avoid unecessary
    domain rebuilds during cpu hotplug event handling.

    Please ignore the cpusets part for now. It needs some more work in order
    to avoid crazy lock nesting. Although I did simplfy and unify domain
    reinitialization logic. We now simply call partition_sched_domains() in
    all the cases. This means that we're using exact same code paths as in
    cpusets case and hence the test below cover cpusets too.
    Cpuset changes to make rebuild_sched_domains() callable from various
    contexts are in the separate patch (right next after this one).

    This not only boots but also easily handles
    while true; do make clean; make -j 8; done
    and
    while true; do on-off-cpu 1; done
    at the same time.
    (on-off-cpu 1 simple does echo 0/1 > /sys/.../cpu1/online thing).

    Suprisingly the box (dual-core Core2) is quite usable. In fact I'm typing
    this on right now in gnome-terminal and things are moving just fine.

    Also this is running with most of the debug features enabled (lockdep,
    mutex, etc) no BUG_ONs or lockdep complaints so far.

    I believe I addressed all of the Dmitry's comments for original Linus'
    version. I changed both fair and rt balancer to mask out non-active cpus.
    And replaced cpu_is_offline() with !cpu_active() in the main scheduler
    code where it made sense (to me).

    Signed-off-by: Max Krasnyanskiy
    Acked-by: Linus Torvalds
    Acked-by: Peter Zijlstra
    Acked-by: Gregory Haskins
    Cc: dmitry.adamushko@gmail.com
    Cc: pj@sgi.com
    Signed-off-by: Ingo Molnar

    Max Krasnyansky
     

16 Jul, 2008

1 commit


06 Jul, 2008

1 commit


04 Jul, 2008

1 commit

  • We have the notion of tracking process-coupling (a.k.a. buddy-wake) via
    the p->se.last_wake / p->se.avg_overlap facilities, but it is only used
    for cfs to cfs interactions. There is no reason why an rt to cfs
    interaction cannot share in establishing a relationhip in a similar
    manner.

    Because PREEMPT_RT runs many kernel threads as FIFO priority, we often
    times have heavy interaction between RT threads waking CFS applications.
    This patch offers a substantial boost (50-60%+) in perfomance under those
    circumstances.

    Signed-off-by: Gregory Haskins
    Cc: npiggin@suse.de
    Cc: rostedt@goodmis.org
    Acked-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Gregory Haskins
     

27 Jun, 2008

18 commits

  • Signed-off-by: Dhaval Giani
    Cc: Srivatsa Vaddagiri
    Cc: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Dhaval Giani
     
  • Measurement shows that the difference between cgroup:/ and cgroup:/foo
    wake_affine() results is that the latter succeeds significantly more.

    Therefore bias the calculations towards failing the test.

    Signed-off-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Increase the accuracy of the effective_load values.

    Not only consider the current increment (as per the attempted wakeup), but
    also consider the delta between when we last adjusted the shares and the
    current situation.

    Signed-off-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • rw_i = {2, 4, 1, 0}
    s_i = {2/7, 4/7, 1/7, 0}

    wakeup on cpu0, weight=1

    rw'_i = {3, 4, 1, 0}
    s'_i = {3/8, 4/8, 1/8, 0}

    s_0 = S * rw_0 / \Sum rw_j ->
    \Sum rw_j = S*rw_0/s_0 = 1*2*7/2 = 7 (correct)

    s'_0 = S * (rw_0 + 1) / (\Sum rw_j + 1) =
    1 * (2+1) / (7+1) = 3/8 (correct

    so we find that adding 1 to cpu0 gains 5/56 in weight
    if say the other cpu were, cpu1, we'd also have to calculate its 4/56 loss

    Signed-off-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • It was observed these mults can overflow.

    Signed-off-by: Srivatsa Vaddagiri
    Signed-off-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Srivatsa Vaddagiri
     
  • s_i = S * rw_i / \Sum_j rw_j

    -> \Sum_j rw_j = S * rw_i / s_i

    -> s'_i = S * (rw_i + w) / (\Sum_j rw_j + w)

    delta s = s' - s = S * (rw + w) / ((S * rw / s) + w)
    = s * (S * (rw + w) / (S * rw + s * w) - 1)

    a = S*(rw+w), b = S*rw + s*w

    delta s = s * (a-b) / b

    IOW, trade one divide for two multiplies

    Signed-off-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Currently task_h_load() computes the load of a task and uses that to either
    subtract it from the total, or add to it.

    However, removing or adding a task need not have any effect on the total load
    at all. Imagine adding a task to a group that is local to one cpu - in that
    case the total load of that cpu is unaffected.

    So properly compute addition/removal:

    s_i = S * rw_i / \Sum_j rw_j
    s'_i = S * (rw_i + wl) / (\Sum_j rw_j + wg)

    then s'_i - s_i gives the change in load.

    Where s_i is the shares for cpu i, S the group weight, rw_i the runqueue weight
    for that cpu, wl the weight we add (subtract) and wg the weight contribution to
    the runqueue.

    Signed-off-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • doing the load balance will change cfs_rq->load.weight (that's the whole point)
    but since that's part of the scale factor, we'll scale back with a different
    amount.

    Weight getting smaller would result in an inflated moved_load which causes
    it to stop balancing too soon.

    Signed-off-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • With hierarchical grouping we can't just compare task weight to rq weight - we
    need to scale the weight appropriately.

    Signed-off-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • While thinking about the previous patch - I realized that using per domain
    aggregate load values in load_balance_fair() is wrong. We should use the
    load value for that CPU.

    By not needing per domain hierarchical load values we don't need to store
    per domain aggregate shares, which greatly simplifies all the math.

    It basically falls apart in two separate computations:
    - per domain update of the shares
    - per CPU update of the hierarchical load

    Also get rid of the move_group_shares() stuff - just re-compute the shares
    again after a successful load balance.

    Signed-off-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • We only need to know the task_weight of the busiest rq - nothing to do
    if there are no tasks there.

    Signed-off-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The idea was to balance groups until we've reached the global goal, however
    Vatsa rightly pointed out that we might never reach that goal this way -
    hence take out this logic.

    [ the initial rationale for this 'feature' was to promote max concurrency
    within a group - it does not however affect fairness ]

    Reported-by: Srivatsa Vaddagiri
    Signed-off-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Srivatsa Vaddagiri
     
  • Keeping the aggregate on the first cpu of the sched domain has two problems:
    - it could collide between different sched domains on different cpus
    - it could slow things down because of the remote accesses

    Signed-off-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Uncouple buddy selection from wakeup granularity.

    The initial idea was that buddies could run ahead as far as a normal task
    can - do this by measuring a pair 'slice' just as we do for a normal task.

    This means we can drop the wakeup_granularity back to 5ms.

    Signed-off-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Try again..

    Initial commit: 18d95a2832c1392a2d63227a7a6d433cb9f2037e
    Revert: 6363ca57c76b7b83639ca8c83fc285fa26a7880e

    Signed-off-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Ok, so why are we in this mess, it was:

    1/w

    but now we mixed that rw in the mix like:

    rw/w

    rw being \Sum w suggests: fiddling w, we should also fiddle rw, humm?

    Signed-off-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • calc_delta_asym() is supposed to do the same as calc_delta_fair() except
    linearly shrink the result for negative nice processes - this causes them
    to have a smaller preemption threshold so that they are more easily preempted.

    The problem is that for task groups se->load.weight is the per cpu share of
    the actual task group weight; take that into account.

    Also provide a debug switch to disable the asymmetry (which I still don't
    like - but it does greatly benefit some workloads)

    This would explain the interactivity issues reported against group scheduling.

    Signed-off-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Try again..

    initial commit: 8f1bc385cfbab474db6c27b5af1e439614f3025c
    revert: f9305d4a0968201b2818dbed0dc8cb0d4ee7aeb3

    Signed-off-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

06 Jun, 2008

1 commit


29 May, 2008

3 commits

  • Prevent short-running wakers of short-running threads from overloading a single
    cpu via wakeup affinity, and wire up disconnected debug option.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Yanmin Zhang reported:

    Comparing with 2.6.25, volanoMark has big regression with kernel 2.6.26-rc1.
    It's about 50% on my 8-core stoakley, 16-core tigerton, and Itanium Montecito.

    With bisect, I located the following patch:

    | 18d95a2832c1392a2d63227a7a6d433cb9f2037e is first bad commit
    | commit 18d95a2832c1392a2d63227a7a6d433cb9f2037e
    | Author: Peter Zijlstra
    | Date: Sat Apr 19 19:45:00 2008 +0200
    |
    | sched: fair-group: SMP-nice for group scheduling

    Revert it so that we get v2.6.25 behavior.

    Bisected-by: Yanmin Zhang
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Yanmin Zhang reported:

    Comparing with kernel 2.6.25, sysbench+mysql(oltp, readonly) has many
    regressions with 2.6.26-rc1:

    1) 8-core stoakley: 28%;
    2) 16-core tigerton: 20%;
    3) Itanium Montvale: 50%.

    Bisect located this patch:

    | 8f1bc385cfbab474db6c27b5af1e439614f3025c is first bad commit
    | commit 8f1bc385cfbab474db6c27b5af1e439614f3025c
    | Author: Peter Zijlstra
    | Date: Sat Apr 19 19:45:00 2008 +0200
    |
    | sched: fair: weight calculations

    Revert it to the 2.6.25 state.

    Bisected-by: Yanmin Zhang
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

24 May, 2008

1 commit


08 May, 2008

1 commit

  • The conversion between virtual and real time is as follows:

    dvt = rw/w * dt dt = w/rw * dvt

    Since we want the fair sleeper granularity to be in real time, we actually
    need to do:

    dvt = - rw/w * l

    This bug could be related to the regression reported by Yanmin Zhang:

    | Comparing with kernel 2.6.25, sysbench+mysql(oltp, readonly) has lots
    | of regressions with 2.6.26-rc1:
    |
    | 1) 8-core stoakley: 28%;
    | 2) 16-core tigerton: 20%;
    | 3) Itanium Montvale: 50%.

    Reported-by: "Zhang, Yanmin"
    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     

06 May, 2008

5 commits

  • this replaces the rq->clock stuff (and possibly cpu_clock()).

    - architectures that have an 'imperfect' hardware clock can set
    CONFIG_HAVE_UNSTABLE_SCHED_CLOCK

    - the 'jiffie' window might be superfulous when we update tick_gtod
    before the __update_sched_clock() call in sched_clock_tick()

    - cpu_clock() might be implemented as:

    sched_clock_cpu(smp_processor_id())

    if the accuracy proves good enough - how far can TSC drift in a
    single jiffie when considering the filtering and idle hooks?

    [ mingo@elte.hu: various fixes and cleanups ]

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Revert debugging commit 7ba2e74ab5a0518bc953042952dd165724bc70c9.
    print_cfs_rq_tasks() can induce live-lock if a task is dequeued
    during list traversal.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • We currently use an optimization to skip the overhead of wake-idle
    processing if more than one task is assigned to a run-queue. The
    assumption is that the system must already be load-balanced or we
    wouldnt be overloaded to begin with.

    The problem is that we are looking at rq->nr_running, which may include
    RT tasks in addition to CFS tasks. Since the presence of RT tasks
    really has no bearing on the balance status of CFS tasks, this throws
    the calculation off.

    This patch changes the logic to only consider the number of CFS tasks
    when making the decision to optimze the wake-idle.

    Signed-off-by: Gregory Haskins
    CC: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Gregory Haskins
     
  • Noticed by sparse:
    kernel/sched.c:760:20: warning: symbol 'sched_feat_names' was not declared. Should it be static?
    kernel/sched.c:767:5: warning: symbol 'sched_feat_open' was not declared. Should it be static?
    kernel/sched_fair.c:845:3: warning: returning void-valued expression
    kernel/sched.c:4386:3: warning: returning void-valued expression

    Signed-off-by: Harvey Harrison
    Signed-off-by: Ingo Molnar

    Harvey Harrison
     
  • Normalized sleeper uses calc_delta*() which requires that the rq load is
    already updated, so move account_entity_enqueue() before place_entity()

    Tested-by: Frans Pop
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

20 Apr, 2008

4 commits

  • Print a tree of weights.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In order to level the hierarchy, we need to calculate load based on the
    root view. That is, each task's load is in the same unit.

    A
    / \
    B 1
    / \
    2 3

    To compute 1's load we do:

    weight(1)
    --------------
    rq_weight(A)

    To compute 2's load we do:

    weight(2) weight(B)
    ------------ * -----------
    rq_weight(B) rw_weight(A)

    This yields load fractions in comparable units.

    The consequence is that it changes virtual time. We used to have:

    time_{i}
    vtime_{i} = ------------
    weight_{i}

    vtime = \Sum vtime_{i} = time / rq_weight.

    But with the new way of load calculation we get that vtime equals time.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • De-couple load-balancing from the rb-trees, so that I can change their
    organization.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Currently FAIR_GROUP sched grows the scheduler latency outside of
    sysctl_sched_latency, invert this so it stays within.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra