14 Apr, 2011

1 commit

  • Now that we've removed the rq->lock requirement from the first part of
    ttwu() and can compute placement without holding any rq->lock, ensure
    we execute the second half of ttwu() on the actual cpu we want the
    task to run on.

    This avoids having to take rq->lock and doing the task enqueue
    remotely, saving lots on cacheline transfers.

    As measured using: http://oss.oracle.com/~mason/sembench.c

    $ for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do echo performance > $i; done
    $ echo 4096 32000 64 128 > /proc/sys/kernel/sem
    $ ./sembench -t 2048 -w 1900 -o 0

    unpatched: run time 30 seconds 647278 worker burns per second
    patched: run time 30 seconds 816715 worker burns per second

    Reviewed-by: Frank Rowand
    Cc: Mike Galbraith
    Cc: Nick Piggin
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110405152729.515897185@chello.nl

    Peter Zijlstra
     

18 Nov, 2010

1 commit

  • By tracking a per-cpu load-avg for each cfs_rq and folding it into a
    global task_group load on each tick we can rework tg_shares_up to be
    strictly per-cpu.

    This should improve cpu-cgroup performance for smp systems
    significantly.

    [ Paul: changed to use queueing cfs_rq + bug fixes ]

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

19 Oct, 2010

1 commit

  • The idea was suggested by Peter Zijlstra here:

    http://marc.info/?l=linux-kernel&m=127476934517534&w=2

    irq time is technically not available to the tasks running on the CPU.
    This patch removes irq time from CPU power piggybacking on
    sched_rt_avg_update().

    Tested this by keeping CPU X busy with a network intensive task having 75%
    oa a single CPU irq processing (hard+soft) on a 4-way system. And start seven
    cycle soakers on the system. Without this change, there will be two tasks on
    each CPU. With this change, there is a single task on irq busy CPU X and
    remaining 7 tasks are spread around among other 3 CPUs.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Venkatesh Pallipadi
     

12 Mar, 2010

7 commits

  • This features has been enabled for quite a while, after testing showed that
    easing preemption for light tasks was harmful to high priority threads.

    Remove the feature flag.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Sync wakeups are critical functionality with a long history. Remove it, we don't
    need the branch or icache footprint.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • This feature never earned its keep, remove it.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Our preemption model relies too heavily on sleeper fairness to disable it
    without dire consequences. Remove the feature, and save a branch or two.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • This feature hasn't been enabled in a long time, remove effectively dead code.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Both avg_overlap and avg_wakeup had an inherent problem in that their accuracy
    was detrimentally affected by cross-cpu wakeups, this because we are missing
    the necessary call to update_curr(). This can't be fixed without increasing
    overhead in our already too fat fastpath.

    Additionally, with recent load balancing changes making us prefer to place tasks
    in an idle cache domain (which is good for compute bound loads), communicating
    tasks suffer when a sync wakeup, which would enable affine placement, is turned
    into a non-sync wakeup by SYNC_LESS. With one task on the runqueue, wake_affine()
    rejects the affine wakeup request, leaving the unfortunate where placed, taking
    frequent cache misses.

    Remove it, and recover some fastpath cycles.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Testing the load which led to this heuristic (nfs4 kbuild) shows that it has
    outlived it's usefullness. With intervening load balancing changes, I cannot
    see any difference with/without, so recover there fastpath cycles.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     

09 Dec, 2009

1 commit


17 Sep, 2009

1 commit

  • Create a new wakeup preemption mode, preempt towards tasks that run
    shorter on avg. It sets next buddy to be sure we actually run the task
    we preempted for.

    Test results:

    root@twins:~# while :; do :; done &
    [1] 6537
    root@twins:~# while :; do :; done &
    [2] 6538
    root@twins:~# while :; do :; done &
    [3] 6539
    root@twins:~# while :; do :; done &
    [4] 6540

    root@twins:/home/peter# ./latt -c4 sleep 4
    Entries: 48 (clients=4)

    Averages:
    ------------------------------
    Max 4750 usec
    Avg 497 usec
    Stdev 737 usec

    root@twins:/home/peter# echo WAKEUP_RUNNING > /debug/sched_features

    root@twins:/home/peter# ./latt -c4 sleep 4
    Entries: 48 (clients=4)

    Averages:
    ------------------------------
    Max 14 usec
    Avg 5 usec
    Stdev 3 usec

    Disabled by default - needs more testing.

    Signed-off-by: Peter Zijlstra
    Acked-by: Mike Galbraith
    Signed-off-by: Ingo Molnar
    LKML-Reference:

    Peter Zijlstra
     

16 Sep, 2009

3 commits

  • We don't need to call update_shares() for each domain we iterate,
    just got the largets one.

    However, we should call it before wake_affine() as well, so that
    that can use up-to-date values too.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Add back FAIR_SLEEPERS and GENTLE_FAIR_SLEEPERS.

    FAIR_SLEEPERS is the old logic: credit sleepers with their sleep time.

    GENTLE_FAIR_SLEEPERS dampens this a bit: 50% of their sleep time gets
    credited.

    The hope here is to still give the benefits of fair-sleepers logic
    (quick wakeups, etc.) while not allow them to have 100% of their
    sleep time as if they were running.

    Cc: Peter Zijlstra
    Cc: Mike Galbraith
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Currently we use overlap to weaken the SYNC hint, but allow it to
    set the hint as well.

    echo NO_SYNC_WAKEUP > /debug/sched_features
    echo SYNC_MORE > /debug/sched_features

    preserves pipe-test behaviour without using the WF_SYNC hint.

    Worth playing with on more workloads...

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

15 Sep, 2009

5 commits

  • I suspect a feed-back loop between cpuidle and the aperf/mperf
    cpu_power bits, where when we have idle C-states lower the ratio,
    which leads to lower cpu_power and then less load, which generates
    more idle time, etc..

    Put in a knob to disable it.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Make the idle balancer more agressive, to improve a
    x264 encoding workload provided by Jason Garrett-Glaser:

    NEXT_BUDDY NO_LB_BIAS
    encoded 600 frames, 252.82 fps, 22096.60 kb/s
    encoded 600 frames, 250.69 fps, 22096.60 kb/s
    encoded 600 frames, 245.76 fps, 22096.60 kb/s

    NO_NEXT_BUDDY LB_BIAS
    encoded 600 frames, 344.44 fps, 22096.60 kb/s
    encoded 600 frames, 346.66 fps, 22096.60 kb/s
    encoded 600 frames, 352.59 fps, 22096.60 kb/s

    NO_NEXT_BUDDY NO_LB_BIAS
    encoded 600 frames, 425.75 fps, 22096.60 kb/s
    encoded 600 frames, 425.45 fps, 22096.60 kb/s
    encoded 600 frames, 422.49 fps, 22096.60 kb/s

    Peter pointed out that this is better done via newidle_idx,
    not via LB_BIAS, newidle balancing should look for where
    there is load _now_, not where there was load 2 ticks ago.

    Worst-case latencies are improved as well as no buddies
    means less vruntime spread. (as per prior lkml discussions)

    This change improves kbuild-peak parallelism as well.

    Reported-by: Jason Garrett-Glaser
    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Add text...

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Add a NEXT_BUDDY feature flag to aid in debugging.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • It consists of two conditions, split them out in separate toggles
    so we can test them independently.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

11 Sep, 2009

1 commit

  • Nikos Chantziaras and Jens Axboe reported that turning off
    NEW_FAIR_SLEEPERS improves desktop interactivity visibly.

    Nikos described his experiences the following way:

    " With this setting, I can do "nice -n 19 make -j20" and
    still have a very smooth desktop and watch a movie at
    the same time. Various other annoyances (like the
    "logout/shutdown/restart" dialog of KDE not appearing
    at all until the background fade-out effect has finished)
    are also gone. So this seems to be the single most
    important setting that vastly improves desktop behavior,
    at least here. "

    Jens described it the following way, referring to a 10-seconds
    xmodmap scheduling delay he was trying to debug:

    " Then I tried switching NO_NEW_FAIR_SLEEPERS on, and then
    I get:

    Performance counter stats for 'xmodmap .xmodmap-carl':

    9.009137 task-clock-msecs # 0.447 CPUs
    18 context-switches # 0.002 M/sec
    1 CPU-migrations # 0.000 M/sec
    315 page-faults # 0.035 M/sec

    0.020167093 seconds time elapsed

    Woot! "

    So disable it for now. In perf trace output i can see weird
    delta timestamps:

    cc1-9943 [001] 2802.059479616: sched_stat_wait: task: as:9944 wait: 2801938766276 [ns]

    That nsec field is not supposed to be that large. More digging
    is needed - but lets turn it off while the real bug is found.

    Reported-by: Nikos Chantziaras
    Tested-by: Nikos Chantziaras
    Reported-by: Jens Axboe
    Tested-by: Jens Axboe
    Acked-by: Peter Zijlstra
    Cc: Mike Galbraith
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

31 Mar, 2009

1 commit

  • * 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (33 commits)
    lockdep: fix deadlock in lockdep_trace_alloc
    lockdep: annotate reclaim context (__GFP_NOFS), fix SLOB
    lockdep: annotate reclaim context (__GFP_NOFS), fix
    lockdep: build fix for !PROVE_LOCKING
    lockstat: warn about disabled lock debugging
    lockdep: use stringify.h
    lockdep: simplify check_prev_add_irq()
    lockdep: get_user_chars() redo
    lockdep: simplify get_user_chars()
    lockdep: add comments to mark_lock_irq()
    lockdep: remove macro usage from mark_held_locks()
    lockdep: fully reduce mark_lock_irq()
    lockdep: merge the !_READ mark_lock_irq() helpers
    lockdep: merge the _READ mark_lock_irq() helpers
    lockdep: simplify mark_lock_irq() helpers #3
    lockdep: further simplify mark_lock_irq() helpers
    lockdep: simplify the mark_lock_irq() helpers
    lockdep: split up mark_lock_irq()
    lockdep: generate usage strings
    lockdep: generate the state bit definitions
    ...

    Linus Torvalds
     

15 Jan, 2009

2 commits

  • Prefer tasks that wake other tasks to preempt quickly. This improves
    performance because more work is available sooner.

    The workload that prompted this patch was a kernel build over NFS4 (for some
    curious and not understood reason we had to revert commit:
    18de9735300756e3ca9c361ef58409d8561dfe0d to make any progress at all)

    Without this patch a make -j8 bzImage (of x86-64 defconfig) would take
    3m30-ish, with this patch we're down to 2m50-ish.

    psql-sysbench/mysql-sysbench show a slight improvement in peak performance as
    well, tbench and vmark seemed to not care.

    It is possible to improve upon the build time (to 2m20-ish) but that seriously
    destroys other benchmarks (just shows that there's more room for tinkering).

    Much thanks to Mike who put in a lot of effort to benchmark things and proved
    a worthy opponent with a competing patch.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Change mutex contention behaviour such that it will sometimes busy wait on
    acquisition - moving its behaviour closer to that of spinlocks.

    This concept got ported to mainline from the -rt tree, where it was originally
    implemented for rtmutexes by Steven Rostedt, based on work by Gregory Haskins.

    Testing with Ingo's test-mutex application (http://lkml.org/lkml/2006/1/8/50)
    gave a 345% boost for VFS scalability on my testbox:

    # ./test-mutex-shm V 16 10 | grep "^avg ops"
    avg ops/sec: 296604

    # ./test-mutex-shm V 16 10 | grep "^avg ops"
    avg ops/sec: 85870

    The key criteria for the busy wait is that the lock owner has to be running on
    a (different) cpu. The idea is that as long as the owner is running, there is a
    fair chance it'll release the lock soon, and thus we'll be better off spinning
    instead of blocking/scheduling.

    Since regular mutexes (as opposed to rtmutexes) do not atomically track the
    owner, we add the owner in a non-atomic fashion and deal with the races in
    the slowpath.

    Furthermore, to ease the testing of the performance impact of this new code,
    there is means to disable this behaviour runtime (without having to reboot
    the system), when scheduler debugging is enabled (CONFIG_SCHED_DEBUG=y),
    by issuing the following command:

    # echo NO_OWNER_SPIN > /debug/sched_features

    This command re-enables spinning again (this is also the default):

    # echo OWNER_SPIN > /debug/sched_features

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

05 Nov, 2008

1 commit

  • Impact: improve/change/fix wakeup-buddy scheduling

    Currently we only have a forward looking buddy, that is, we prefer to
    schedule to the task we last woke up, under the presumption that its
    going to consume the data we just produced, and therefore will have
    cache hot benefits.

    This allows co-waking producer/consumer task pairs to run ahead of the
    pack for a little while, keeping their cache warm. Without this, we
    would interleave all pairs, utterly trashing the cache.

    This patch introduces a backward looking buddy, that is, suppose that
    in the above scenario, the consumer preempts the producer before it
    can go to sleep, we will therefore miss the wakeup from consumer to
    producer (its already running, after all), breaking the cycle and
    reverting to the cache-trashing interleaved schedule pattern.

    The backward buddy will try to schedule back to the task that woke us
    up in case the forward buddy is not available, under the assumption
    that the last task will be the one with the most cache hot task around
    barring current.

    This will basically allow a task to continue after it got preempted.

    In order to avoid starvation, we allow either buddy to get wakeup_gran
    ahead of the pack.

    Signed-off-by: Peter Zijlstra
    Acked-by: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

20 Oct, 2008

1 commit

  • David Miller reported that hrtick update overhead has tripled the
    wakeup overhead on Sparc64.

    That is too much - disable the HRTICK feature for now by default,
    until a faster implementation is found.

    Reported-by: David Miller
    Acked-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

22 Sep, 2008

2 commits

  • WAKEUP_OVERLAP is not a winner on a 16way box, running psql+sysbench:

    .27-rc7-NO_WAKEUP_OVERLAP .27-rc7-WAKEUP_OVERLAP
    -------------------------------------------------
    1: 694 811 +14.39%
    2: 1454 1427 -1.86%
    4: 3017 3070 +1.70%
    8: 5694 5808 +1.96%
    16: 10592 10612 +0.19%
    32: 9693 9647 -0.48%
    64: 8507 8262 -2.97%
    128: 8402 7087 -18.55%
    256: 8419 5124 -64.30%
    512: 7990 3671 -117.62%
    -------------------------------------------------
    SUM: 64466 55524 -16.11%

    ... so turn it off by default.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Lin Ming reported a 10% OLTP regression against 2.6.27-rc4.

    The difference seems to come from different preemption agressiveness,
    which affects the cache footprint of the workload and its effective
    cache trashing.

    Aggresively preempt a task if its avg overlap is very small, this should
    avoid the task going to sleep and find it still running when we schedule
    back to it - saving a wakeup.

    Reported-by: Lin Ming
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

21 Aug, 2008

1 commit

  • Yanmin reported a significant regression on his 16-core machine due to:

    commit 93b75217df39e6d75889cc6f8050343286aff4a5
    Author: Peter Zijlstra
    Date: Fri Jun 27 13:41:33 2008 +0200

    Flip back to the old behaviour.

    Reported-by: "Zhang, Yanmin"
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

27 Jun, 2008

5 commits

  • Measurement shows that the difference between cgroup:/ and cgroup:/foo
    wake_affine() results is that the latter succeeds significantly more.

    Therefore bias the calculations towards failing the test.

    Signed-off-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • We found that the affine wakeup code needs rather accurate load figures
    to be effective. The trouble is that updating the load figures is fairly
    expensive with group scheduling. Therefore ratelimit the updating.

    Signed-off-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The bias given by source/target_load functions can be very large, disable
    it by default to get faster convergence.

    Signed-off-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • calc_delta_asym() is supposed to do the same as calc_delta_fair() except
    linearly shrink the result for negative nice processes - this causes them
    to have a smaller preemption threshold so that they are more easily preempted.

    The problem is that for task groups se->load.weight is the per cpu share of
    the actual task group weight; take that into account.

    Also provide a debug switch to disable the asymmetry (which I still don't
    like - but it does greatly benefit some workloads)

    This would explain the interactivity issues reported against group scheduling.

    Signed-off-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Try again..

    initial commit: 8f1bc385cfbab474db6c27b5af1e439614f3025c
    revert: f9305d4a0968201b2818dbed0dc8cb0d4ee7aeb3

    Signed-off-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

10 Jun, 2008

1 commit


20 Apr, 2008

1 commit