14 Aug, 2011

2 commits

  • Since [sched/cpupri: Remove the vec->lock], member pri_active
    of struct cpupri is not needed any more, just remove it. Also
    clean stuff related to it.

    Signed-off-by: Yong Zhang
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110806001004.GA2207@zhy
    Signed-off-by: Ingo Molnar

    Yong Zhang
     
  • sched/cpupri: Remove the vec->lock

    The cpupri vec->lock has been showing up as a top contention
    lately. This is because of the RT push/pull logic takes an
    agressive approach for migrating RT tasks. The cpupri logic is
    in place to improve the performance of the push/pull when dealing
    with large number CPU machines.

    The problem though is a vec->lock is required, where a vec is a
    global per RT priority structure. That is, if there are lots of
    RT tasks at the same priority, every time they are added or removed
    from the RT queue, this global vec->lock is taken. Now that more
    kernel threads are becoming RT (RCU boost and threaded interrupts)
    this is becoming much more of an issue.

    There are two variables that are being synced by the vec->lock.
    The cpupri bitmask, and the vec->counter. The cpupri bitmask
    is one bit per priority. If a RT priority vec has a process queued,
    then the vec->count is > 0 and the cpupri bitmask is set for that
    RT priority.

    If the cpupri bitmask gets out of sync with the vec->counter, we could
    end up pushing a low proirity RT task to a high priority queue.
    That RT task that could have run immediately could be queued on a
    run queue with a higher priority task indefinitely.

    The solution is not to use the cpupri bitmask and just look at the
    vec->count directly when doing a pull. The cpupri bitmask is just
    a fast way to scan the RT priorities when a pull is made. Instead
    of using the bitmask, and just examine all RT priorities, and
    look at the vec->counts, we could eliminate the vec->lock. The
    scan of RT tasks is to find a run queue that we can push an RT task
    to, and we do not push to a high priority queue, thus the scan only
    needs to go from 1 to RT task->prio, and not all 100 RT priorities.

    The push algorithm, which does the scan of RT priorities (and
    scan of the bitmask) only happens when we have an overloaded RT run
    queue (more than one RT task queued). The grabbing of the vec->lock
    happens every time any RT task is queued or dequeued on the run
    queue for that priority. The slowing down of the scan by not using
    a bitmask is negligible by the speed up of removing the vec->lock
    contention, and replacing it with an atomic counter and memory barrier.

    To prove this, I wrote a patch that times both the loop and the code
    that grabs the vec->locks. I passed the patches to various people
    (and companies) to test and show the results. I let everyone choose
    their own load to test, giving different loads on the system,
    for various different setups.

    Here's some of the results: (snipping to a few CPUs to not make
    this change log huge, but the results were consistent across
    the entire system).

    System 1 (24 CPUs)

    Before patch:
    CPU: Name Count Max Min Average Total
    ---- ---- ----- --- --- ------- -----
    [...]
    cpu 20: loop 3057 1.766 0.061 0.642 1963.170
    vec 6782949 90.469 0.089 0.414 2811760.503
    cpu 21: loop 2617 1.723 0.062 0.641 1679.074
    vec 6782810 90.499 0.089 0.291 1978499.900
    cpu 22: loop 2212 1.863 0.063 0.699 1547.160
    vec 6767244 85.685 0.089 0.435 2949676.898
    cpu 23: loop 2320 2.013 0.062 0.594 1380.265
    vec 6781694 87.923 0.088 0.431 2928538.224

    After patch:
    cpu 20: loop 2078 1.579 0.061 0.533 1108.006
    vec 6164555 5.704 0.060 0.143 885185.809
    cpu 21: loop 2268 1.712 0.065 0.575 1305.248
    vec 6153376 5.558 0.060 0.187 1154960.469
    cpu 22: loop 1542 1.639 0.095 0.533 823.249
    vec 6156510 5.720 0.060 0.190 1172727.232
    cpu 23: loop 1650 1.733 0.068 0.545 900.781
    vec 6170784 5.533 0.060 0.167 1034287.953

    All times are in microseconds. The 'loop' is the amount of time spent
    doing the loop across the priorities (before patch uses bitmask).
    the 'vec' is the amount of time in the code that requires grabbing
    the vec->lock. The second patch just does not have the vec lock, but
    encompasses the same code.

    Amazingly the loop code even went down on average. The vec code went
    from .5 down to .18, that's more than half the time spent!

    Note, more than one test was run, but they all had the same results.

    System 2 (64 CPUs)

    Before patch:
    CPU: Name Count Max Min Average Total
    ---- ---- ----- --- --- ------- -----
    cpu 60: loop 0 0 0 0 0
    vec 5410840 277.954 0.084 0.782 4232895.727
    cpu 61: loop 0 0 0 0 0
    vec 4915648 188.399 0.084 0.570 2803220.301
    cpu 62: loop 0 0 0 0 0
    vec 5356076 276.417 0.085 0.786 4214544.548
    cpu 63: loop 0 0 0 0 0
    vec 4891837 170.531 0.085 0.799 3910948.833

    After patch:
    cpu 60: loop 0 0 0 0 0
    vec 5365118 5.080 0.021 0.063 340490.267
    cpu 61: loop 0 0 0 0 0
    vec 4898590 1.757 0.019 0.071 347903.615
    cpu 62: loop 0 0 0 0 0
    vec 5737130 3.067 0.021 0.119 687108.734
    cpu 63: loop 0 0 0 0 0
    vec 4903228 1.822 0.021 0.071 348506.477

    The test run during the measurement did not have any (very few,
    from other CPUs) RT tasks pushing. But this shows that it helped
    out tremendously with the contention, as the contention happens
    because the vec->lock is taken only on queuing at an RT priority,
    and different CPUs that queue tasks at the same priority will
    have contention.

    I tested on my own 4 CPU machine with the following results:

    Before patch:
    CPU: Name Count Max Min Average Total
    ---- ---- ----- --- --- ------- -----
    cpu 0: loop 2377 1.489 0.158 0.588 1398.395
    vec 4484 770.146 2.301 4.396 19711.755
    cpu 1: loop 2169 1.962 0.160 0.576 1250.110
    vec 4425 152.769 2.297 4.030 17834.228
    cpu 2: loop 2324 1.749 0.155 0.559 1299.799
    vec 4368 779.632 2.325 4.665 20379.268
    cpu 3: loop 2325 1.629 0.157 0.561 1306.113
    vec 4650 408.782 2.394 4.348 20222.577

    After patch:
    CPU: Name Count Max Min Average Total
    ---- ---- ----- --- --- ------- -----
    cpu 0: loop 2121 1.616 0.113 0.636 1349.189
    vec 4303 1.151 0.225 0.421 1811.966
    cpu 1: loop 2130 1.638 0.178 0.644 1372.927
    vec 4627 1.379 0.235 0.428 1983.648
    cpu 2: loop 2056 1.464 0.165 0.637 1310.141
    vec 4471 1.311 0.217 0.433 1937.927
    cpu 3: loop 2154 1.481 0.162 0.601 1295.083
    vec 4236 1.253 0.230 0.425 1803.008

    This was running my migrate.c code that can be found at:
    http://lwn.net/Articles/425763/

    The migrate code does stress the RT tasks a bit. This shows that
    the loop did increase a little after the patch, but not by much.
    The vec code dropped dramatically. From 4.3us down to .42us.
    That's a 10x improvement!

    Tested-by: Mike Galbraith
    Tested-by: Luis Claudio R. Gonçalves
    Tested-by: Matthew Hank Sabins
    Signed-off-by: Steven Rostedt
    Reviewed-by: Gregory Haskins
    Acked-by: Hillf Danton
    Signed-off-by: Peter Zijlstra
    Cc: Chris Mason
    Link: http://lkml.kernel.org/r/1312317372.18583.101.camel@gandalf.stny.rr.com
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     

17 Jul, 2010

1 commit

  • As of commit dcce284 ("mm: Extend gfp masking to the page
    allocator") and commit 7e85ee0 ("slab,slub: don't enable
    interrupts during early boot"), the slab allocator makes
    sure we don't attempt to sleep during boot.

    Therefore, remove bootmem special cases from the scheduler
    and use plain GFP_KERNEL instead.

    Signed-off-by: Pekka Enberg
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Pekka Enberg
     

15 Dec, 2009

1 commit


30 Mar, 2009

1 commit


25 Nov, 2008

1 commit

  • Impact: stack usage reduction, (future) size reduction for large NR_CPUS.

    Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
    space for small nr_cpu_ids but big CONFIG_NR_CPUS.

    The fact cpupro_init is called both before and after the slab is
    available makes for an ugly parameter unfortunately.

    We also use cpumask_any_and to get rid of a temporary in cpupri_find.

    Signed-off-by: Rusty Russell
    Signed-off-by: Ingo Molnar

    Rusty Russell
     

06 Jun, 2008

3 commits

  • Peter pointed out that the last version of the "fix" was still one off
    under certain circumstances. Use BITS_TO_LONG instead to get an
    accurate result.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • A rounding error was pointed out by Peter Zijlstra which would result
    in the structure holding priorities to be off by one.

    Signed-off-by: Gregory Haskins
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Arnaldo Carvalho de Melo
    Signed-off-by: Thomas Gleixner

    Gregory Haskins
     
  • The current code use a linear algorithm which causes scaling issues
    on larger SMP machines. This patch replaces that algorithm with a
    2-dimensional bitmap to reduce latencies in the wake-up path.

    Signed-off-by: Gregory Haskins
    Acked-by: Steven Rostedt
    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner

    Gregory Haskins