20 Apr, 2008

40 commits

  • provide a text based interface to the scheduler features; this saves the
    'user' from setting bits using decimal arithmetic.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • unused at the moment.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Print a tree of weights.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In order to level the hierarchy, we need to calculate load based on the
    root view. That is, each task's load is in the same unit.

    A
    / \
    B 1
    / \
    2 3

    To compute 1's load we do:

    weight(1)
    --------------
    rq_weight(A)

    To compute 2's load we do:

    weight(2) weight(B)
    ------------ * -----------
    rq_weight(B) rw_weight(A)

    This yields load fractions in comparable units.

    The consequence is that it changes virtual time. We used to have:

    time_{i}
    vtime_{i} = ------------
    weight_{i}

    vtime = \Sum vtime_{i} = time / rq_weight.

    But with the new way of load calculation we get that vtime equals time.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • De-couple load-balancing from the rb-trees, so that I can change their
    organization.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Currently FAIR_GROUP sched grows the scheduler latency outside of
    sysctl_sched_latency, invert this so it stays within.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Now that the group hierarchy can have an arbitrary depth the O(n^2) nature
    of RT task dequeues will really hurt. Optimize this by providing space to
    store the tree path, so we can walk it the other way.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Add some extra debug output so we can get a better overview of the
    full hierarchy.

    We print the cgroup path after each cfs_rq, so we can see what group
    we're looking at.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Implement SMP nice support for the full group hierarchy.

    On each load-balance action, compile a sched_domain wide view of the full
    task_group tree. We compute the domain wide view when walking down the
    hierarchy, and readjust the weights when walking back up.

    After collecting and readjusting the domain wide view, we try to balance the
    tasks within the task_groups. The current approach is a naively balance each
    task group until we've moved the targeted amount of load.

    Inspired by Srivatsa Vaddsgiri's previous code and Abhishek Chandra's H-SMP
    paper.

    XXX: there will be some numerical issues due to the limited nature of
    SCHED_LOAD_SCALE wrt to representing a task_groups influence on the
    total weight. When the tree is deep enough, or the task weight small
    enough, we'll run out of bits.

    Signed-off-by: Peter Zijlstra
    CC: Abhishek Chandra
    CC: Srivatsa Vaddagiri
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • [rebased for sched-devel/latest]

    - Add a new cpuset file, having levels:
    sched_relax_domain_level

    - Modify partition_sched_domains() and build_sched_domains()
    to take attributes parameter passed from cpuset.

    - Fill newidle_idx for node domains which currently unused but
    might be required if sched_relax_domain_level become higher.

    - We can change the default level by boot option 'relax_domain_level='.

    Signed-off-by: Hidetoshi Seto
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto
     
  • This patch introduces new feature of cpuset - sched domain customization.

    This version provides a per-cpuset file 'sched_relax_domain_level' that
    enable us to change the searching range of scheduler, which used to limit
    how many cpus the scheduler searches at some schedule events, such as
    wakening task and running out of runqueue.

    Signed-off-by: Hidetoshi Seto
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto
     
  • Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • multi level rt constraints

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Add the full parentchild relation thing into task_groups as well.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • UID grouping doesn't actually have a task_group representing the root of
    the task_group tree. Add one.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • This patch makes the group scheduler multi hierarchy aware.

    [a.p.zijlstra@chello.nl: rt-parts and assorted fixes]
    Signed-off-by: Dhaval Giani
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Dhaval Giani
     
  • This patch allows tasks and groups to exist in the same cfs_rq. With this
    change the CFS group scheduling follows a 1/(M+N) model from a 1/(1+N)
    fairness model where M tasks and N groups exist at the cfs_rq level.

    [a.p.zijlstra@chello.nl: rt bits and assorted fixes]
    Signed-off-by: Dhaval Giani
    Signed-off-by: Srivatsa Vaddagiri
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Dhaval Giani
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Add a new function that accepts a pointer to the "newly allowed cpus"
    cpumask argument.

    int set_cpus_allowed_ptr(struct task_struct *p, const cpumask_t *new_mask)

    The current set_cpus_allowed() function is modified to use the above
    but this does not result in an ABI change. And with some compiler
    optimization help, it may not introduce any additional overhead.

    Additionally, to enforce the read only nature of the new_mask arg, the
    "const" property is migrated to sub-functions called by set_cpus_allowed.
    This silences compiler warnings.

    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • Move the setting of nr_cpu_ids from sched_init() to start_kernel()
    so that it's available as early as possible.

    Note that an arch has the option of setting it even earlier if need be,
    but it should not result in a different value than the setup_nr_cpu_ids()
    function.

    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • * Remove another cpumask_t variable from stack that was missed in the
    last kernel_sched_c updates.

    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • * Add cpu_sysdev_class functions to display the following maps
    with cpulist_scnprintf().

    cpu_online_map
    cpu_present_map
    cpu_possible_map

    * Small change to include/linux/sysdev.h to allow the attribute
    name and label to be different (to avoid collision with the
    "attr_online" entry for bringing cpus on- and off-line.)

    Cc: H. Peter Anvin
    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • * Cleaned up references to cpumask_scnprintf() and added new
    cpulist_scnprintf() interfaces where appropriate.

    * Fix some small bugs (or code efficiency improvments) for various uses
    of cpumask_scnprintf.

    * Clean up some checkpatch errors.

    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • * Removed kmalloc (or local array) in show_shared_cpu_map().

    * Added show_shared_cpu_list() function.

    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • * Here is a simple patch to use an allocated array of cpumasks to
    represent cpumask_of_cpu() instead of constructing one on the stack.
    It's based on the Kconfig option "HAVE_CPUMASK_OF_CPU_MAP" which is
    currently only set for x86_64 SMP. Otherwise the the existing
    cpumask_of_cpu() is used but has been changed to produce an lvalue
    so a pointer to it can be used.

    Cc: H. Peter Anvin
    Signed-off-by: Christoph Lameter
    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • * Add a static cpumask_t variable "CPU_MASK_ALL_PTR" to use as
    a pointer reference to CPU_MASK_ALL. This reduces where possible
    the instances where CPU_MASK_ALL allocates and fills a large
    array on the stack. Used only if NR_CPUS > BITS_PER_LONG.

    * Change init/main.c to use new set_cpus_allowed_ptr().

    Depends on:
    [sched-devel]: sched: add new set_cpus_allowed_ptr function

    Cc: H. Peter Anvin
    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • * Remove empty cpumask_t (and all non-zero/non-null) variables
    in SD_*_INIT macros. Use memset(0) to clear. Also, don't
    inline the initializer functions to save on stack space in
    build_sched_domains().

    * Merge change to include/linux/topology.h that uses the new
    node_to_cpumask_ptr function in the nr_cpus_node macro into
    this patch.

    Depends on:
    [mm-patch]: asm-generic-add-node_to_cpumask_ptr-macro.patch
    [sched-devel]: sched: add new set_cpus_allowed_ptr function

    Cc: H. Peter Anvin
    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • * Use new node_to_cpumask_ptr. This creates a pointer to the
    cpumask for a given node. This definition is in mm patch:

    asm-generic-add-node_to_cpumask_ptr-macro.patch

    * Use new set_cpus_allowed_ptr function.

    Depends on:
    [mm-patch]: asm-generic-add-node_to_cpumask_ptr-macro.patch
    [sched-devel]: sched: add new set_cpus_allowed_ptr function
    [x86/latest]: x86: add cpus_scnprintf function

    Cc: Greg Kroah-Hartman
    Cc: Greg Banks
    Cc: H. Peter Anvin
    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • * Modify sched_affinity functions to pass cpumask_t variables by reference
    instead of by value.

    * Use new set_cpus_allowed_ptr function.

    Depends on:
    [sched-devel]: sched: add new set_cpus_allowed_ptr function

    Cc: Paul Jackson
    Cc: Cliff Wickman
    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • * Modify cpuset_cpus_allowed to return the currently allowed cpuset
    via a pointer argument instead of as the function return value.

    * Use new set_cpus_allowed_ptr function.

    * Cleanup CPU_MASK_ALL and NODE_MASK_ALL uses.

    Depends on:
    [sched-devel]: sched: add new set_cpus_allowed_ptr function

    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • * Use new set_cpus_allowed_ptr() function added by previous patch,
    which instead of passing the "newly allowed cpus" cpumask_t arg
    by value, pass it by pointer:

    -int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask)
    +int set_cpus_allowed_ptr(struct task_struct *p, const cpumask_t *new_mask)

    * Modify CPU_MASK_ALL

    Depends on:
    [sched-devel]: sched: add new set_cpus_allowed_ptr function

    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • * Use new set_cpus_allowed_ptr() function added by previous patch,
    which instead of passing the "newly allowed cpus" cpumask_t arg
    by value, pass it by pointer:

    -int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask)
    +int set_cpus_allowed_ptr(struct task_struct *p, const cpumask_t *new_mask)

    * Cleanup uses of CPU_MASK_ALL.

    * Collapse other NR_CPUS changes to arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
    Use pointers to cpumask_t arguments whenever possible.

    Depends on:
    [sched-devel]: sched: add new set_cpus_allowed_ptr function

    Cc: Len Brown
    Cc: Dave Jones
    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • * Change fixed size arrays to per_cpu variables or dynamically allocated
    arrays in sched_init() and sched_init_smp().

    (1) static struct sched_entity *init_sched_entity_p[NR_CPUS];
    (1) static struct cfs_rq *init_cfs_rq_p[NR_CPUS];
    (1) static struct sched_rt_entity *init_sched_rt_entity_p[NR_CPUS];
    (1) static struct rt_rq *init_rt_rq_p[NR_CPUS];
    static struct sched_group **sched_group_nodes_bycpu[NR_CPUS];

    (1) - these arrays are allocated via alloc_bootmem_low()

    * Change sched_domain_debug_one() to use cpulist_scnprintf instead of
    cpumask_scnprintf. This reduces the output buffer required and improves
    readability when large NR_CPU count machines arrive.

    * In sched_create_group() we allocate new arrays based on nr_cpu_ids.

    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • * Replace usages of CPU_MASK_NONE, CPU_MASK_ALL, NODE_MASK_NONE,
    NODE_MASK_ALL to reduce stack requirements for large NR_CPUS
    and MAXNODES counts.

    * In some cases, the cpumask variable was initialized but then overwritten
    with another value. This is the case for changes like this:

    - cpumask_t oldmask = CPU_MASK_ALL;
    + cpumask_t oldmask;

    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • * Move large array "struct bootnode nodes" from stack to _initdata
    section to reduce amount of stack space required.

    Cc: H. Peter Anvin
    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • Create a simple macro to always return a pointer to the node_to_cpumask(node)
    value. This relies on compiler optimization to remove the extra indirection:

    #define node_to_cpumask_ptr(v, node) \
    cpumask_t _##v = node_to_cpumask(node), *v = &_##v

    For those systems with a large cpumask size, then a true pointer
    to the array element can be used:

    #define node_to_cpumask_ptr(v, node) \
    cpumask_t *v = &(node_to_cpumask_map[node])

    A node_to_cpumask_ptr_next() macro is provided to access another
    node_to_cpumask value.

    The other change is to always include asm-generic/topology.h moving the
    ifdef CONFIG_NUMA to this same file.

    Note: there are no references to either of these new macros in this patch,
    only the definition.

    Based on 2.6.25-rc5-mm1

    # alpha
    Cc: Richard Henderson

    # fujitsu
    Cc: David Howells

    # ia64
    Cc: Tony Luck

    # powerpc
    Cc: Paul Mackerras
    Cc: Anton Blanchard

    # sparc
    Cc: David S. Miller
    Cc: William L. Irwin

    # x86
    Cc: H. Peter Anvin

    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • Change the following arrays sized by NR_CPUS to be PERCPU variables:

    static struct op_msrs cpu_msrs[NR_CPUS];
    static unsigned long saved_lvtpc[NR_CPUS];

    Also some minor complaints from checkpatch.pl fixed.

    Based on:
    git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
    git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git

    All changes were transparent except for:

    static void nmi_shutdown(void)
    {
    + struct op_msrs *msrs = &__get_cpu_var(cpu_msrs);
    nmi_enabled = 0;
    on_each_cpu(nmi_cpu_shutdown, NULL, 0, 1);
    unregister_die_notifier(&profile_exceptions_nb);
    - model->shutdown(cpu_msrs);
    + model->shutdown(msrs);
    free_msrs();
    }

    The existing code passed a reference to cpu 0's instance of struct op_msrs
    to model->shutdown, whilst the other functions are passed a reference to
    instance of a struct op_msrs. This seemed to be a bug to me
    even though as long as cpu 0 and are of the same type it would
    have the same effect...?

    Cc: Philippe Elie
    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • * Change the following static arrays sized by NR_CPUS to
    per_cpu data variables:

    _cpuid4_info *cpuid4_info[NR_CPUS];
    _index_kobject *index_kobject[NR_CPUS];
    kobject * cache_kobject[NR_CPUS];

    * Remove the local NR_CPUS array with a kmalloc'd region in
    show_shared_cpu_map().

    Also some minor complaints from checkpatch.pl fixed.

    Cc: H. Peter Anvin
    Cc: Andi Kleen
    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • Add a new function cpumask_scnprintf_len() to return the number of
    characters needed to display "len" cpumask bits. The current method
    of allocating NR_CPUS bytes is incorrect as what's really needed is
    9 characters per 32-bit word of cpumask bits (8 hex digits plus the
    seperator [','] or the terminating NULL.) This function provides the
    caller the means to allocate the correct string length.

    Cc: Paul Jackson
    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis