13 Jan, 2006

18 commits

  • )

    From: Al Viro

    task_pt_regs() needs the same offset-by-8 to match copy_thread()

    Signed-off-by: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@osdl.org
     
  • Signed-off-by: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Al Viro
     
  • Signed-off-by: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Al Viro
     
  • Signed-off-by: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Al Viro
     
  • Signed-off-by: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Al Viro
     
  • )

    From: Al Viro

    rename alpha_task_regs() to task_pt_regs(), switch open-coded instances
    to use of the helper.

    Signed-off-by: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@osdl.org
     
  • use task_stack_page() for accesses to stack page of task in alpha-specific
    parts of tree

    Signed-off-by: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Al Viro
     
  • use task_thread_info() for accesses to thread_info of task in arch/alpha
    and include/asm-alpha

    Signed-off-by: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Al Viro
     
  • Patchset annotates arch/* uses of ->thread_info. Ones that really are about
    access of thread_info of given process are simply switched to
    task_thread_info(task); ones that deal with access to objects on stack are
    switched to new helper - task_stack_page(). A _lot_ of the latter are
    actually open-coded instances of "find where pt_regs are"; those are
    consolidated into task_pt_regs(task) (many architectures actually have such
    helper already).

    Note that these annotations are not mandatory - any code not converted to
    these helpers still works. However, they clean up a lot of places and have
    actually caught a number of bugs, so converting out of tree ports would be a
    good idea...

    As an example of breakage caught by that stuff, see i386 pt_regs mess - we
    used to have it open-coded in a bunch of places and when back in April Stas
    had fixed a bug in copy_thread(), the rest had been left out of sync. That
    required two followup patches (the latest - just before 2.6.15) _and_ still
    had left /proc/*/stat eip field broken. Try ps -eo eip on i386 and watch the
    junk...

    This patch:

    new helper - task_stack_page(task). Returns pointer to the memory object
    containing task stack; usually thread_info of task sits in the beginning
    of that object.

    Signed-off-by: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Al Viro
     
  • )

    From: Nick Piggin

    Track the last waker CPU, and only consider wakeup-balancing if there's a
    match between current waker CPU and the previous waker CPU. This ensures
    that there is some correlation between two subsequent wakeup events before
    we move the task. Should help random-wakeup workloads on large SMP
    systems, by reducing the migration attempts by a factor of nr_cpus.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@osdl.org
     
  • )

    From: Ingo Molnar

    This is the latest version of the scheduler cache-hot-auto-tune patch.

    The first problem was that detection time scaled with O(N^2), which is
    unacceptable on larger SMP and NUMA systems. To solve this:

    - I've added a 'domain distance' function, which is used to cache
    measurement results. Each distance is only measured once. This means
    that e.g. on NUMA distances of 0, 1 and 2 might be measured, on HT
    distances 0 and 1, and on SMP distance 0 is measured. The code walks
    the domain tree to determine the distance, so it automatically follows
    whatever hierarchy an architecture sets up. This cuts down on the boot
    time significantly and removes the O(N^2) limit. The only assumption
    is that migration costs can be expressed as a function of domain
    distance - this covers the overwhelming majority of existing systems,
    and is a good guess even for more assymetric systems.

    [ People hacking systems that have assymetries that break this
    assumption (e.g. different CPU speeds) should experiment a bit with
    the cpu_distance() function. Adding a ->migration_distance factor to
    the domain structure would be one possible solution - but lets first
    see the problem systems, if they exist at all. Lets not overdesign. ]

    Another problem was that only a single cache-size was used for measuring
    the cost of migration, and most architectures didnt set that variable
    up. Furthermore, a single cache-size does not fit NUMA hierarchies with
    L3 caches and does not fit HT setups, where different CPUs will often
    have different 'effective cache sizes'. To solve this problem:

    - Instead of relying on a single cache-size provided by the platform and
    sticking to it, the code now auto-detects the 'effective migration
    cost' between two measured CPUs, via iterating through a wide range of
    cachesizes. The code searches for the maximum migration cost, which
    occurs when the working set of the test-workload falls just below the
    'effective cache size'. I.e. real-life optimized search is done for
    the maximum migration cost, between two real CPUs.

    This, amongst other things, has the positive effect hat if e.g. two
    CPUs share a L2/L3 cache, a different (and accurate) migration cost
    will be found than between two CPUs on the same system that dont share
    any caches.

    (The reliable measurement of migration costs is tricky - see the source
    for details.)

    Furthermore i've added various boot-time options to override/tune
    migration behavior.

    Firstly, there's a blanket override for autodetection:

    migration_cost=1000,2000,3000

    will override the depth 0/1/2 values with 1msec/2msec/3msec values.

    Secondly, there's a global factor that can be used to increase (or
    decrease) the autodetected values:

    migration_factor=120

    will increase the autodetected values by 20%. This option is useful to
    tune things in a workload-dependent way - e.g. if a workload is
    cache-insensitive then CPU utilization can be maximized by specifying
    migration_factor=0.

    I've tested the autodetection code quite extensively on x86, on 3
    P3/Xeon/2MB, and the autodetected values look pretty good:

    Dual Celeron (128K L2 cache):

    ---------------------
    migration cost matrix (max_cache_size: 131072, cpu: 467 MHz):
    ---------------------
    [00] [01]
    [00]: - 1.7(1)
    [01]: 1.7(1) -
    ---------------------
    cacheflush times [2]: 0.0 (0) 1.7 (1784008)
    ---------------------

    Here the slow memory subsystem dominates system performance, and even
    though caches are small, the migration cost is 1.7 msecs.

    Dual HT P4 (512K L2 cache):

    ---------------------
    migration cost matrix (max_cache_size: 524288, cpu: 2379 MHz):
    ---------------------
    [00] [01] [02] [03]
    [00]: - 0.4(1) 0.0(0) 0.4(1)
    [01]: 0.4(1) - 0.4(1) 0.0(0)
    [02]: 0.0(0) 0.4(1) - 0.4(1)
    [03]: 0.4(1) 0.0(0) 0.4(1) -
    ---------------------
    cacheflush times [2]: 0.0 (33900) 0.4 (448514)
    ---------------------

    Here it can be seen that there is no migration cost between two HT
    siblings (CPU#0/2 and CPU#1/3 are separate physical CPUs). A fast memory
    system makes inter-physical-CPU migration pretty cheap: 0.4 msecs.

    8-way P3/Xeon [2MB L2 cache]:

    ---------------------
    migration cost matrix (max_cache_size: 2097152, cpu: 700 MHz):
    ---------------------
    [00] [01] [02] [03] [04] [05] [06] [07]
    [00]: - 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
    [01]: 19.2(1) - 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
    [02]: 19.2(1) 19.2(1) - 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
    [03]: 19.2(1) 19.2(1) 19.2(1) - 19.2(1) 19.2(1) 19.2(1) 19.2(1)
    [04]: 19.2(1) 19.2(1) 19.2(1) 19.2(1) - 19.2(1) 19.2(1) 19.2(1)
    [05]: 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) - 19.2(1) 19.2(1)
    [06]: 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) - 19.2(1)
    [07]: 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) -
    ---------------------
    cacheflush times [2]: 0.0 (0) 19.2 (19281756)
    ---------------------

    This one has huge caches and a relatively slow memory subsystem - so the
    migration cost is 19 msecs.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Ashok Raj
    Signed-off-by: Ken Chen
    Cc:
    Signed-off-by: John Hawkes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@osdl.org
     
  • Add per-arch sched_cacheflush() which is a write-back cacheflush used by
    the migration-cost calibration code at bootup time.

    Signed-off-by: Ingo Molnar
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Fixes bugzilla.kernel.org bug 2903.

    Cc:
    Cc:
    Signed-off-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Remove unecessary page++ from memmap_init_zone loop.

    Signed-off-by: Greg Ungerer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Ungerer
     
  • e.g. The sx8 driver uses names like sx8/0.

    This would make a md component dev name like

    /sys/block/md0/md/dev-sx8/0

    which is not allowed. So we change the '/' to '!' just like
    fs/partitions/check.c(register_disk) does.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Brown
     
  • Adapt tiny-shmem.c to the new do_truncate() prototype.

    Signed-off-by: Catalin Marinas
    Acked-by: Matt Mackall
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     
  • This ensures that reserved pages are not migrated. Reserved pages
    currently cause the WARN_ON to trigger in migrate_page_add()

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • If ordered tag isn't supported, request ordering for barrier
    sequencing is performed by queue draining, which basically hangs the
    request queue until elv_completed_request() reports completion of all
    previous fs requests.

    The condition check in elv_completed_request() was only performed for
    fs requests. If a special request is queued between the last
    to-be-drained request and the barrier sequence, draining is never
    completed and the queue is stalled forever.

    This patch moves the end-of-draining condition check such that it's
    performed for all requests.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

12 Jan, 2006

22 commits