25 Jul, 2011

1 commit

  • * 'kvm-updates/3.1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (143 commits)
    KVM: IOMMU: Disable device assignment without interrupt remapping
    KVM: MMU: trace mmio page fault
    KVM: MMU: mmio page fault support
    KVM: MMU: reorganize struct kvm_shadow_walk_iterator
    KVM: MMU: lockless walking shadow page table
    KVM: MMU: do not need atomicly to set/clear spte
    KVM: MMU: introduce the rules to modify shadow page table
    KVM: MMU: abstract some functions to handle fault pfn
    KVM: MMU: filter out the mmio pfn from the fault pfn
    KVM: MMU: remove bypass_guest_pf
    KVM: MMU: split kvm_mmu_free_page
    KVM: MMU: count used shadow pages on prepareing path
    KVM: MMU: rename 'pt_write' to 'emulate'
    KVM: MMU: cleanup for FNAME(fetch)
    KVM: MMU: optimize to handle dirty bit
    KVM: MMU: cache mmio info on page fault path
    KVM: x86: introduce vcpu_mmio_gva_to_gpa to cleanup the code
    KVM: MMU: do not update slot bitmap if spte is nonpresent
    KVM: MMU: fix walking shadow page table
    KVM guest: KVM Steal time registration
    ...

    Linus Torvalds
     

21 Jul, 2011

1 commit

  • Allow for sched_domain spans that overlap by giving such domains their
    own sched_group list instead of sharing the sched_groups amongst
    each-other.

    This is needed for machines with more than 16 nodes, because
    sched_domain_node_span() will generate a node mask from the
    16 nearest nodes without regard if these masks have any overlap.

    Currently sched_domains have a sched_group that maps to their child
    sched_domain span, and since there is no overlap we share the
    sched_group between the sched_domains of the various CPUs. If however
    there is overlap, we would need to link the sched_group list in
    different ways for each cpu, and hence sharing isn't possible.

    In order to solve this, allocate private sched_groups for each CPU's
    sched_domain but have the sched_groups share a sched_group_power
    structure such that we can uniquely track the power.

    Reported-and-tested-by: Anton Blanchard
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/n/tip-08bxqw9wis3qti9u5inifh3y@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

14 Jul, 2011

1 commit

  • This patch makes update_rq_clock() aware of steal time.
    The mechanism of operation is not different from irq_time,
    and follows the same principles. This lives in a CONFIG
    option itself, and can be compiled out independently of
    the rest of steal time reporting. The effect of disabling it
    is that the scheduler will still report steal time (that cannot be
    disabled), but won't use this information for cpu power adjustments.

    Everytime update_rq_clock_task() is invoked, we query information
    about how much time was stolen since last call, and feed it into
    sched_rt_avg_update().

    Although steal time reporting in account_process_tick() keeps
    track of the last time we read the steal clock, in prev_steal_time,
    this patch do it independently using another field,
    prev_steal_time_rq. This is because otherwise, information about time
    accounted in update_process_tick() would never reach us in update_rq_clock().

    Signed-off-by: Glauber Costa
    Acked-by: Rik van Riel
    Acked-by: Peter Zijlstra
    Tested-by: Eric B Munson
    CC: Jeremy Fitzhardinge
    CC: Anthony Liguori
    Signed-off-by: Avi Kivity

    Glauber Costa
     

14 Apr, 2011

1 commit

  • Now that we've removed the rq->lock requirement from the first part of
    ttwu() and can compute placement without holding any rq->lock, ensure
    we execute the second half of ttwu() on the actual cpu we want the
    task to run on.

    This avoids having to take rq->lock and doing the task enqueue
    remotely, saving lots on cacheline transfers.

    As measured using: http://oss.oracle.com/~mason/sembench.c

    $ for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do echo performance > $i; done
    $ echo 4096 32000 64 128 > /proc/sys/kernel/sem
    $ ./sembench -t 2048 -w 1900 -o 0

    unpatched: run time 30 seconds 647278 worker burns per second
    patched: run time 30 seconds 816715 worker burns per second

    Reviewed-by: Frank Rowand
    Cc: Mike Galbraith
    Cc: Nick Piggin
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110405152729.515897185@chello.nl

    Peter Zijlstra
     

18 Nov, 2010

1 commit

  • By tracking a per-cpu load-avg for each cfs_rq and folding it into a
    global task_group load on each tick we can rework tg_shares_up to be
    strictly per-cpu.

    This should improve cpu-cgroup performance for smp systems
    significantly.

    [ Paul: changed to use queueing cfs_rq + bug fixes ]

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

19 Oct, 2010

1 commit

  • The idea was suggested by Peter Zijlstra here:

    http://marc.info/?l=linux-kernel&m=127476934517534&w=2

    irq time is technically not available to the tasks running on the CPU.
    This patch removes irq time from CPU power piggybacking on
    sched_rt_avg_update().

    Tested this by keeping CPU X busy with a network intensive task having 75%
    oa a single CPU irq processing (hard+soft) on a 4-way system. And start seven
    cycle soakers on the system. Without this change, there will be two tasks on
    each CPU. With this change, there is a single task on irq busy CPU X and
    remaining 7 tasks are spread around among other 3 CPUs.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Venkatesh Pallipadi
     

12 Mar, 2010

7 commits

  • This features has been enabled for quite a while, after testing showed that
    easing preemption for light tasks was harmful to high priority threads.

    Remove the feature flag.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Sync wakeups are critical functionality with a long history. Remove it, we don't
    need the branch or icache footprint.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • This feature never earned its keep, remove it.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Our preemption model relies too heavily on sleeper fairness to disable it
    without dire consequences. Remove the feature, and save a branch or two.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • This feature hasn't been enabled in a long time, remove effectively dead code.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Both avg_overlap and avg_wakeup had an inherent problem in that their accuracy
    was detrimentally affected by cross-cpu wakeups, this because we are missing
    the necessary call to update_curr(). This can't be fixed without increasing
    overhead in our already too fat fastpath.

    Additionally, with recent load balancing changes making us prefer to place tasks
    in an idle cache domain (which is good for compute bound loads), communicating
    tasks suffer when a sync wakeup, which would enable affine placement, is turned
    into a non-sync wakeup by SYNC_LESS. With one task on the runqueue, wake_affine()
    rejects the affine wakeup request, leaving the unfortunate where placed, taking
    frequent cache misses.

    Remove it, and recover some fastpath cycles.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Testing the load which led to this heuristic (nfs4 kbuild) shows that it has
    outlived it's usefullness. With intervening load balancing changes, I cannot
    see any difference with/without, so recover there fastpath cycles.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     

09 Dec, 2009

1 commit


17 Sep, 2009

1 commit

  • Create a new wakeup preemption mode, preempt towards tasks that run
    shorter on avg. It sets next buddy to be sure we actually run the task
    we preempted for.

    Test results:

    root@twins:~# while :; do :; done &
    [1] 6537
    root@twins:~# while :; do :; done &
    [2] 6538
    root@twins:~# while :; do :; done &
    [3] 6539
    root@twins:~# while :; do :; done &
    [4] 6540

    root@twins:/home/peter# ./latt -c4 sleep 4
    Entries: 48 (clients=4)

    Averages:
    ------------------------------
    Max 4750 usec
    Avg 497 usec
    Stdev 737 usec

    root@twins:/home/peter# echo WAKEUP_RUNNING > /debug/sched_features

    root@twins:/home/peter# ./latt -c4 sleep 4
    Entries: 48 (clients=4)

    Averages:
    ------------------------------
    Max 14 usec
    Avg 5 usec
    Stdev 3 usec

    Disabled by default - needs more testing.

    Signed-off-by: Peter Zijlstra
    Acked-by: Mike Galbraith
    Signed-off-by: Ingo Molnar
    LKML-Reference:

    Peter Zijlstra
     

16 Sep, 2009

3 commits

  • We don't need to call update_shares() for each domain we iterate,
    just got the largets one.

    However, we should call it before wake_affine() as well, so that
    that can use up-to-date values too.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Add back FAIR_SLEEPERS and GENTLE_FAIR_SLEEPERS.

    FAIR_SLEEPERS is the old logic: credit sleepers with their sleep time.

    GENTLE_FAIR_SLEEPERS dampens this a bit: 50% of their sleep time gets
    credited.

    The hope here is to still give the benefits of fair-sleepers logic
    (quick wakeups, etc.) while not allow them to have 100% of their
    sleep time as if they were running.

    Cc: Peter Zijlstra
    Cc: Mike Galbraith
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Currently we use overlap to weaken the SYNC hint, but allow it to
    set the hint as well.

    echo NO_SYNC_WAKEUP > /debug/sched_features
    echo SYNC_MORE > /debug/sched_features

    preserves pipe-test behaviour without using the WF_SYNC hint.

    Worth playing with on more workloads...

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

15 Sep, 2009

5 commits

  • I suspect a feed-back loop between cpuidle and the aperf/mperf
    cpu_power bits, where when we have idle C-states lower the ratio,
    which leads to lower cpu_power and then less load, which generates
    more idle time, etc..

    Put in a knob to disable it.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Make the idle balancer more agressive, to improve a
    x264 encoding workload provided by Jason Garrett-Glaser:

    NEXT_BUDDY NO_LB_BIAS
    encoded 600 frames, 252.82 fps, 22096.60 kb/s
    encoded 600 frames, 250.69 fps, 22096.60 kb/s
    encoded 600 frames, 245.76 fps, 22096.60 kb/s

    NO_NEXT_BUDDY LB_BIAS
    encoded 600 frames, 344.44 fps, 22096.60 kb/s
    encoded 600 frames, 346.66 fps, 22096.60 kb/s
    encoded 600 frames, 352.59 fps, 22096.60 kb/s

    NO_NEXT_BUDDY NO_LB_BIAS
    encoded 600 frames, 425.75 fps, 22096.60 kb/s
    encoded 600 frames, 425.45 fps, 22096.60 kb/s
    encoded 600 frames, 422.49 fps, 22096.60 kb/s

    Peter pointed out that this is better done via newidle_idx,
    not via LB_BIAS, newidle balancing should look for where
    there is load _now_, not where there was load 2 ticks ago.

    Worst-case latencies are improved as well as no buddies
    means less vruntime spread. (as per prior lkml discussions)

    This change improves kbuild-peak parallelism as well.

    Reported-by: Jason Garrett-Glaser
    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Add text...

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Add a NEXT_BUDDY feature flag to aid in debugging.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • It consists of two conditions, split them out in separate toggles
    so we can test them independently.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

11 Sep, 2009

1 commit

  • Nikos Chantziaras and Jens Axboe reported that turning off
    NEW_FAIR_SLEEPERS improves desktop interactivity visibly.

    Nikos described his experiences the following way:

    " With this setting, I can do "nice -n 19 make -j20" and
    still have a very smooth desktop and watch a movie at
    the same time. Various other annoyances (like the
    "logout/shutdown/restart" dialog of KDE not appearing
    at all until the background fade-out effect has finished)
    are also gone. So this seems to be the single most
    important setting that vastly improves desktop behavior,
    at least here. "

    Jens described it the following way, referring to a 10-seconds
    xmodmap scheduling delay he was trying to debug:

    " Then I tried switching NO_NEW_FAIR_SLEEPERS on, and then
    I get:

    Performance counter stats for 'xmodmap .xmodmap-carl':

    9.009137 task-clock-msecs # 0.447 CPUs
    18 context-switches # 0.002 M/sec
    1 CPU-migrations # 0.000 M/sec
    315 page-faults # 0.035 M/sec

    0.020167093 seconds time elapsed

    Woot! "

    So disable it for now. In perf trace output i can see weird
    delta timestamps:

    cc1-9943 [001] 2802.059479616: sched_stat_wait: task: as:9944 wait: 2801938766276 [ns]

    That nsec field is not supposed to be that large. More digging
    is needed - but lets turn it off while the real bug is found.

    Reported-by: Nikos Chantziaras
    Tested-by: Nikos Chantziaras
    Reported-by: Jens Axboe
    Tested-by: Jens Axboe
    Acked-by: Peter Zijlstra
    Cc: Mike Galbraith
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

31 Mar, 2009

1 commit

  • * 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (33 commits)
    lockdep: fix deadlock in lockdep_trace_alloc
    lockdep: annotate reclaim context (__GFP_NOFS), fix SLOB
    lockdep: annotate reclaim context (__GFP_NOFS), fix
    lockdep: build fix for !PROVE_LOCKING
    lockstat: warn about disabled lock debugging
    lockdep: use stringify.h
    lockdep: simplify check_prev_add_irq()
    lockdep: get_user_chars() redo
    lockdep: simplify get_user_chars()
    lockdep: add comments to mark_lock_irq()
    lockdep: remove macro usage from mark_held_locks()
    lockdep: fully reduce mark_lock_irq()
    lockdep: merge the !_READ mark_lock_irq() helpers
    lockdep: merge the _READ mark_lock_irq() helpers
    lockdep: simplify mark_lock_irq() helpers #3
    lockdep: further simplify mark_lock_irq() helpers
    lockdep: simplify the mark_lock_irq() helpers
    lockdep: split up mark_lock_irq()
    lockdep: generate usage strings
    lockdep: generate the state bit definitions
    ...

    Linus Torvalds
     

15 Jan, 2009

2 commits

  • Prefer tasks that wake other tasks to preempt quickly. This improves
    performance because more work is available sooner.

    The workload that prompted this patch was a kernel build over NFS4 (for some
    curious and not understood reason we had to revert commit:
    18de9735300756e3ca9c361ef58409d8561dfe0d to make any progress at all)

    Without this patch a make -j8 bzImage (of x86-64 defconfig) would take
    3m30-ish, with this patch we're down to 2m50-ish.

    psql-sysbench/mysql-sysbench show a slight improvement in peak performance as
    well, tbench and vmark seemed to not care.

    It is possible to improve upon the build time (to 2m20-ish) but that seriously
    destroys other benchmarks (just shows that there's more room for tinkering).

    Much thanks to Mike who put in a lot of effort to benchmark things and proved
    a worthy opponent with a competing patch.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Change mutex contention behaviour such that it will sometimes busy wait on
    acquisition - moving its behaviour closer to that of spinlocks.

    This concept got ported to mainline from the -rt tree, where it was originally
    implemented for rtmutexes by Steven Rostedt, based on work by Gregory Haskins.

    Testing with Ingo's test-mutex application (http://lkml.org/lkml/2006/1/8/50)
    gave a 345% boost for VFS scalability on my testbox:

    # ./test-mutex-shm V 16 10 | grep "^avg ops"
    avg ops/sec: 296604

    # ./test-mutex-shm V 16 10 | grep "^avg ops"
    avg ops/sec: 85870

    The key criteria for the busy wait is that the lock owner has to be running on
    a (different) cpu. The idea is that as long as the owner is running, there is a
    fair chance it'll release the lock soon, and thus we'll be better off spinning
    instead of blocking/scheduling.

    Since regular mutexes (as opposed to rtmutexes) do not atomically track the
    owner, we add the owner in a non-atomic fashion and deal with the races in
    the slowpath.

    Furthermore, to ease the testing of the performance impact of this new code,
    there is means to disable this behaviour runtime (without having to reboot
    the system), when scheduler debugging is enabled (CONFIG_SCHED_DEBUG=y),
    by issuing the following command:

    # echo NO_OWNER_SPIN > /debug/sched_features

    This command re-enables spinning again (this is also the default):

    # echo OWNER_SPIN > /debug/sched_features

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

05 Nov, 2008

1 commit

  • Impact: improve/change/fix wakeup-buddy scheduling

    Currently we only have a forward looking buddy, that is, we prefer to
    schedule to the task we last woke up, under the presumption that its
    going to consume the data we just produced, and therefore will have
    cache hot benefits.

    This allows co-waking producer/consumer task pairs to run ahead of the
    pack for a little while, keeping their cache warm. Without this, we
    would interleave all pairs, utterly trashing the cache.

    This patch introduces a backward looking buddy, that is, suppose that
    in the above scenario, the consumer preempts the producer before it
    can go to sleep, we will therefore miss the wakeup from consumer to
    producer (its already running, after all), breaking the cycle and
    reverting to the cache-trashing interleaved schedule pattern.

    The backward buddy will try to schedule back to the task that woke us
    up in case the forward buddy is not available, under the assumption
    that the last task will be the one with the most cache hot task around
    barring current.

    This will basically allow a task to continue after it got preempted.

    In order to avoid starvation, we allow either buddy to get wakeup_gran
    ahead of the pack.

    Signed-off-by: Peter Zijlstra
    Acked-by: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

20 Oct, 2008

1 commit

  • David Miller reported that hrtick update overhead has tripled the
    wakeup overhead on Sparc64.

    That is too much - disable the HRTICK feature for now by default,
    until a faster implementation is found.

    Reported-by: David Miller
    Acked-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

22 Sep, 2008

2 commits

  • WAKEUP_OVERLAP is not a winner on a 16way box, running psql+sysbench:

    .27-rc7-NO_WAKEUP_OVERLAP .27-rc7-WAKEUP_OVERLAP
    -------------------------------------------------
    1: 694 811 +14.39%
    2: 1454 1427 -1.86%
    4: 3017 3070 +1.70%
    8: 5694 5808 +1.96%
    16: 10592 10612 +0.19%
    32: 9693 9647 -0.48%
    64: 8507 8262 -2.97%
    128: 8402 7087 -18.55%
    256: 8419 5124 -64.30%
    512: 7990 3671 -117.62%
    -------------------------------------------------
    SUM: 64466 55524 -16.11%

    ... so turn it off by default.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Lin Ming reported a 10% OLTP regression against 2.6.27-rc4.

    The difference seems to come from different preemption agressiveness,
    which affects the cache footprint of the workload and its effective
    cache trashing.

    Aggresively preempt a task if its avg overlap is very small, this should
    avoid the task going to sleep and find it still running when we schedule
    back to it - saving a wakeup.

    Reported-by: Lin Ming
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

21 Aug, 2008

1 commit

  • Yanmin reported a significant regression on his 16-core machine due to:

    commit 93b75217df39e6d75889cc6f8050343286aff4a5
    Author: Peter Zijlstra
    Date: Fri Jun 27 13:41:33 2008 +0200

    Flip back to the old behaviour.

    Reported-by: "Zhang, Yanmin"
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

27 Jun, 2008

5 commits

  • Measurement shows that the difference between cgroup:/ and cgroup:/foo
    wake_affine() results is that the latter succeeds significantly more.

    Therefore bias the calculations towards failing the test.

    Signed-off-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • We found that the affine wakeup code needs rather accurate load figures
    to be effective. The trouble is that updating the load figures is fairly
    expensive with group scheduling. Therefore ratelimit the updating.

    Signed-off-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The bias given by source/target_load functions can be very large, disable
    it by default to get faster convergence.

    Signed-off-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • calc_delta_asym() is supposed to do the same as calc_delta_fair() except
    linearly shrink the result for negative nice processes - this causes them
    to have a smaller preemption threshold so that they are more easily preempted.

    The problem is that for task groups se->load.weight is the per cpu share of
    the actual task group weight; take that into account.

    Also provide a debug switch to disable the asymmetry (which I still don't
    like - but it does greatly benefit some workloads)

    This would explain the interactivity issues reported against group scheduling.

    Signed-off-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Try again..

    initial commit: 8f1bc385cfbab474db6c27b5af1e439614f3025c
    revert: f9305d4a0968201b2818dbed0dc8cb0d4ee7aeb3

    Signed-off-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

10 Jun, 2008

1 commit


20 Apr, 2008

1 commit