19 Oct, 2007

1 commit

  • Hell knows what happened in commit 63b05203af57e7de4f3bb63b8b81d43bc196d32b
    during 2.6.9 development. Commit introduced io_wait field which remained
    write-only than and still remains write-only.

    Also garbage collect macros which "use" io_wait.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

18 Oct, 2007

1 commit


17 Oct, 2007

11 commits

  • For those who don't care about CONFIG_SECURITY.

    Signed-off-by: Alexey Dobriyan
    Cc: "Serge E. Hallyn"
    Cc: Casey Schaufler
    Cc: James Morris
    Cc: Stephen Smalley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • There is nice 2 byte hole after struct task_struct::ioprio field
    into which we can put two 1-byte fields: ->fpu_counter and ->oomkilladj.

    Signed-off-by: Alexey Dobriyan
    Acked-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • For those who deselect POSIX message queues.

    Reduces SLAB size of user_struct from 64 to 32 bytes here, SLUB size -- from
    40 bytes to 32 bytes.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • robust_list, compat_robust_list, pi_state_list, pi_state_cache are
    really used if futexes are on.

    Signed-off-by: Alexey Dobriyan
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • oomkilladj is int, but values which can be assigned to it are -17, [-16,
    15], thus fitting into s8.

    While patch itself doesn't help in making task_struct smaller, because of
    natural alignment of ->link_count, it will make picture clearer wrt futher
    task_struct reduction patches. My plan is to move ->fpu_counter and
    ->oomkilladj after ->ioprio filling hole on i386 and x86_64. But that's
    for later, because bloated distro configs need looking at as well.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • This adds the MMF_DUMP_ELF_HEADERS option to /proc/pid/coredump_filter.
    This dumps the first page (only) of a private file mapping if it appears to
    be a mapping of an ELF file. Including these pages in the core dump may
    give sufficient identifying information to associate the original DSO and
    executable file images and their debugging information with a core file in
    a generic way just from its contents (e.g. when those binaries were built
    with ld --build-id). I expect this to become the default behavior
    eventually. Existing versions of gdb can be confused by the core dumps it
    creates, so it won't enabled by default for some time to come. Soon many
    people will have systems with a gdb that handle these dumps, so they can
    arrange to set the bit at boot and have it inherited system-wide.

    This also cleans up the checking of the MMF_DUMP_* flag bits, which did not
    need to be using atomic macros.

    Signed-off-by: Roland McGrath
    Cc: Hidehiro Kawai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • Control the trigger limit for softlockup warnings. This is useful for
    debugging softlockups, by lowering the softlockup_thresh to identify
    possible softlockups earlier.

    This patch:
    1. Adds a sysctl softlockup_thresh with valid values of 1-60s
    (Higher value to disable false positives)
    2. Changes the softlockup printk to print the cpu softlockup time

    [akpm@linux-foundation.org: Fix various warnings and add definition of "two"]
    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • Based on ideas of Andrew:
    http://marc.info/?l=linux-kernel&m=102912915020543&w=2

    Scale the bdi dirty limit inversly with the tasks dirty rate.
    This makes heavy writers have a lower dirty limit than the occasional writer.

    Andrea proposed something similar:
    http://lwn.net/Articles/152277/

    The main disadvantage to his patch is that he uses an unrelated quantity to
    measure time, which leaves him with a workload dependant tunable. Other than
    that the two approaches appear quite similar.

    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • When CONFIG_SYSFS is not set, CONFIG_FAIR_USER_SCHED fails to build
    with

    kernel/built-in.o: In function `uids_kobject_init':
    (.init.text+0x1488): undefined reference to `kernel_subsys'
    kernel/built-in.o: In function `uids_kobject_init':
    (.init.text+0x1490): undefined reference to `kernel_subsys'
    kernel/built-in.o: In function `uids_kobject_init':
    (.init.text+0x1480): undefined reference to `kernel_subsys'
    kernel/built-in.o: In function `uids_kobject_init':
    (.init.text+0x1494): undefined reference to `kernel_subsys'

    This patch fixes this build error.

    Signed-off-by: Srivatsa Vaddagiri
    Signed-off-by: Dhaval Giani
    Signed-off-by: Ingo Molnar

    Dhaval Giani
     
  • Remove the cpuset hooks that defined sched domains depending on the setting
    of the 'cpu_exclusive' flag.

    The cpu_exclusive flag can only be set on a child if it is set on the
    parent.

    This made that flag painfully unsuitable for use as a flag defining a
    partitioning of a system.

    It was entirely unobvious to a cpuset user what partitioning of sched
    domains they would be causing when they set that one cpu_exclusive bit on
    one cpuset, because it depended on what CPUs were in the remainder of that
    cpusets siblings and child cpusets, after subtracting out other
    cpu_exclusive cpusets.

    Furthermore, there was no way on production systems to query the
    result.

    Using the cpu_exclusive flag for this was simply wrong from the get go.

    Fortunately, it was sufficiently borked that so far as I know, almost no
    successful use has been made of this. One real time group did use it to
    affectively isolate CPUs from any load balancing efforts. They are willing
    to adapt to alternative mechanisms for this, such as someway to manipulate
    the list of isolated CPUs on a running system. They can do without this
    present cpu_exclusive based mechanism while we develop an alternative.

    There is a real risk, to the best of my understanding, of users
    accidentally setting up a partitioned scheduler domains, inhibiting desired
    load balancing across all their CPUs, due to the nonobvious (from the
    cpuset perspective) side affects of the cpu_exclusive flag.

    Furthermore, since there was no way on a running system to see what one was
    doing with sched domains, this change will be invisible to any using code.
    Unless they have real insight to the scheduler load balancing choices, they
    will be unable to detect that this change has been made in the kernel's
    behaviour.

    Initial discussion on lkml of this patch has generated much comment. My
    (probably controversial) take on that discussion is that it has reached a
    rough concensus that the current cpuset cpu_exclusive mechanism for
    defining sched domains is borked. There is no concensus on the
    replacement. But since we can remove this mechanism, and since its
    continued presence risks causing unwanted partitioning of the schedulers
    load balancing, we should remove it while we can, as we proceed to work the
    replacement scheduler domain mechanisms.

    Signed-off-by: Paul Jackson
    Cc: Ingo Molnar
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Dinakar Guniguntala
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Move the definitions of struct mm_struct and struct vma_area_struct to
    include/mm_types.h. This allows to define more function in asm/pgtable.h
    and friends with inline assemblies instead of macros. Compile tested on
    i386, powerpc, powerpc64, s390-32, s390-64 and x86_64.

    [aurelien@aurel32.net: build fix]
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Aurelien Jarno
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Schwidefsky
     

15 Oct, 2007

27 commits

  • modify account_system_time() to add cputime to cpustat->guest if we are
    running a VCPU. We add this cputime to cpustat->user instead of
    cpustat->system because this part of KVM code is in fact user code
    although it is executed in the kernel. We duplicate VCPU time between
    guest and user to allow an unmodified "top(1)" to display correct value.
    A modified "top(1)" is able to display good cpu user time and cpu guest
    time by subtracting cpu guest time from cpu user time. Update "gtime" in
    task_struct accordingly.

    Signed-off-by: Laurent Vivier
    Acked-by: Avi Kivity
    Signed-off-by: Ingo Molnar

    Laurent Vivier
     
  • like for cpustat, introduce the "gtime" (guest time of the task) and
    "cgtime" (guest time of the task children) fields for the
    tasks. Modify signal_struct and task_struct.

    Modify /proc//stat to display these new fields.

    Signed-off-by: Laurent Vivier
    Acked-by: Avi Kivity
    Signed-off-by: Ingo Molnar

    Laurent Vivier
     
  • add new migration statistics when SCHED_DEBUG and SCHEDSTATS
    is enabled. Available in /proc//sched.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • reintroduce a simplified version of cache-hot/cold scheduling
    affinity. This improves performance with certain SMP workloads,
    such as sysbench.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Prevent wakeup over-scheduling. Once a task has been preempted by a
    task of the same or lower priority, it becomes ineligible for repeated
    preemption by same until it has been ticked, or slept. Instead, the
    task is marked for preemption at the next tick. Tasks of higher
    priority still preempt immediately.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Add tunables in sysfs to modify a user's cpu share.

    A directory is created in sysfs for each new user in the system.

    /sys/kernel/uids//cpu_share

    Reading this file returns the cpu shares granted for the user.
    Writing into this file modifies the cpu share for the user. Only an
    administrator is allowed to modify a user's cpu share.

    Ex:
    # cd /sys/kernel/uids/
    # cat 512/cpu_share
    1024
    # echo 2048 > 512/cpu_share
    # cat 512/cpu_share
    2048
    #

    Signed-off-by: Srivatsa Vaddagiri
    Signed-off-by: Dhaval Giani
    Signed-off-by: Ingo Molnar

    Dhaval Giani
     
  • cleanup: rename task_grp to task_group. No need to save two characters
    and 'grp' is annoying to read.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Here's another piece of low hanging obsolete fruit.

    Remove obsolete TASK_NONINTERACTIVE.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • mark scheduling classes as const. The speeds up the code
    a bit and shrinks it:

    text data bss dec hex filename
    40027 4018 292 44337 ad31 sched.o.before
    40190 3842 292 44324 ad24 sched.o.after

    Signed-off-by: Ingo Molnar
    Reviewed-by: Thomas Gleixner

    Ingo Molnar
     
  • speed up and simplify vslice calculations.

    [ From: Mike Galbraith : build fix ]

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • rename all 'cnt' fields and variables to the less yucky 'count' name.

    yuckage noticed by Andrew Morton.

    no change in code, other than the /proc/sched_debug bkl_count string got
    a bit larger:

    text data bss dec hex filename
    38236 3506 24 41766 a326 sched.o.before
    38240 3506 24 41770 a32a sched.o.after

    Signed-off-by: Ingo Molnar
    Reviewed-by: Thomas Gleixner

    Ingo Molnar
     
  • undo some of the recent changes that are not needed after all,
    such as last_min_vruntime.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra

    Ingo Molnar
     
  • add vslice: the load-dependent "virtual slice" a task should
    run ideally, so that the observed latency stays within the
    sched_latency window.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Reviewed-by: Thomas Gleixner

    Peter Zijlstra
     
  • remove unneeded tunables.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Reviewed-by: Thomas Gleixner

    Ingo Molnar
     
  • add per task and per rq BKL usage statistics.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Reviewed-by: Thomas Gleixner

    Ingo Molnar
     
  • Enable user-id based fair group scheduling. This is useful for anyone
    who wants to test the group scheduler w/o having to enable
    CONFIG_CGROUPS.

    A separate scheduling group (i.e struct task_grp) is automatically created for
    every new user added to the system. Upon uid change for a task, it is made to
    move to the corresponding scheduling group.

    A /proc tunable (/proc/root_user_share) is also provided to tune root
    user's quota of cpu bandwidth.

    Signed-off-by: Srivatsa Vaddagiri
    Signed-off-by: Dhaval Giani
    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Reviewed-by: Thomas Gleixner

    Srivatsa Vaddagiri
     
  • With the view of supporting user-id based fair scheduling (and not just
    container-based fair scheduling), this patch renames several functions
    and makes them independent of whether they are being used for container
    or user-id based fair scheduling.

    Also fix a problem reported by KAMEZAWA Hiroyuki (wrt allocating
    less-sized array for tg->cfs_rq[] and tf->se[]).

    Signed-off-by: Srivatsa Vaddagiri
    Signed-off-by: Dhaval Giani
    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Reviewed-by: Thomas Gleixner

    Srivatsa Vaddagiri
     
  • Revert removal of set_curr_task.
    Use put_prev_task/set_curr_task when changing groups/policies

    Signed-off-by: Srivatsa Vaddagiri < vatsa@linux.vnet.ibm.com>
    Signed-off-by: Dhaval Giani
    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra

    Srivatsa Vaddagiri
     
  • rework enqueue/dequeue_entity() to get rid of
    sched_class::set_curr_task(). This simplifies sched_setscheduler(),
    rt_mutex_setprio() and sched_move_tasks().

    text data bss dec hex filename
    24330 2734 20 27084 69cc sched.o.before
    24233 2730 20 26983 6967 sched.o.after

    Signed-off-by: Dmitry Adamushko
    Signed-off-by: Srivatsa Vaddagiri
    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Reviewed-by: Thomas Gleixner

    Dmitry Adamushko
     
  • the 'p' (task_struct) parameter in the sched_class :: yield_task() is
    redundant as the caller is always the 'current'. Get rid of it.

    text data bss dec hex filename
    24341 2734 20 27095 69d7 sched.o.before
    24330 2734 20 27084 69cc sched.o.after

    Signed-off-by: Dmitry Adamushko
    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Reviewed-by: Thomas Gleixner

    Dmitry Adamushko
     
  • Get rid of 'sched_entity::fair_key'.

    As a side effect, 'current' is not kept withing the tree for
    SCHED_NORMAL/BATCH tasks anymore. This simplifies some parts of code
    (e.g. entity_tick() and yield_task_fair()) and also somewhat optimizes
    them (e.g. a single update_curr() now vs. dequeue/enqueue() before in
    entity_tick()).

    Signed-off-by: Dmitry Adamushko
    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Reviewed-by: Thomas Gleixner

    Dmitry Adamushko
     
  • remove wait_runtime based fields and features, now that the CFS
    math has been changed over to the vruntime metric.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mike Galbraith
    Reviewed-by: Thomas Gleixner

    Ingo Molnar
     
  • remove the wait_runtime-limit fields and the code depending on it, now
    that the math has been changed over to rely on the vruntime metric.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mike Galbraith
    Reviewed-by: Thomas Gleixner

    Ingo Molnar
     
  • introduce se->vruntime as a sum of weighted delta-exec's, and use that
    as the key into the tree.

    the idea to use absolute virtual time as the basic metric of scheduling
    has been first raised by William Lee Irwin, advanced by Tong Li and first
    prototyped by Roman Zippel in the "Really Fair Scheduler" (RFS) patchset.

    also see:

    http://lkml.org/lkml/2007/9/2/76

    for a simpler variant of this patch.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mike Galbraith
    Reviewed-by: Thomas Gleixner

    Ingo Molnar
     
  • remove the stat_gran code - it was disabled by default and it causes
    unnecessary overhead.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mike Galbraith
    Reviewed-by: Thomas Gleixner

    Ingo Molnar
     
  • use constants if !CONFIG_SCHED_DEBUG.

    this speeds up the code and reduces code-size:

    text data bss dec hex filename
    27464 3014 16 30494 771e sched.o.before
    26929 3010 20 29959 7507 sched.o.after

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mike Galbraith
    Reviewed-by: Thomas Gleixner

    Ingo Molnar
     
  • track the maximum amount of time a task has executed while
    the CPU load was at least 2x. (i.e. at least two nice-0
    tasks were runnable)

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mike Galbraith
    Reviewed-by: Thomas Gleixner

    Ingo Molnar