28 Jun, 2005

1 commit

  • This updates the CFQ io scheduler to the new time sliced design (cfq
    v3). It provides full process fairness, while giving excellent
    aggregate system throughput even for many competing processes. It
    supports io priorities, either inherited from the cpu nice value or set
    directly with the ioprio_get/set syscalls. The latter closely mimic
    set/getpriority.

    This import is based on my latest from -mm.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

26 Jun, 2005

39 commits

  • Linus Torvalds
     
  • 1. Establish a simple API for process freezing defined in linux/include/sched.h:

    frozen(process) Check for frozen process
    freezing(process) Check if a process is being frozen
    freeze(process) Tell a process to freeze (go to refrigerator)
    thaw_process(process) Restart process
    frozen_process(process) Process is frozen now

    2. Remove all references to PF_FREEZE and PF_FROZEN from all
    kernel sources except sched.h

    3. Fix numerous locations where try_to_freeze is manually done by a driver

    4. Remove the argument that is no longer necessary from two function calls.

    5. Some whitespace cleanup

    6. Clear potential race in refrigerator (provides an open window of PF_FREEZE
    cleared before setting PF_FROZEN, recalc_sigpending does not check
    PF_FROZEN).

    This patch does not address the problem of freeze_processes() violating the rule
    that a task may only modify its own flags by setting PF_FREEZE. This is not clean
    in an SMP environment. freeze(process) is therefore not SMP safe!

    Signed-off-by: Christoph Lameter
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This patch makes use of ALIGN() to remove duplicate round-up code.

    Signed-off-by: Nick Wilson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Wilson
     
  • Signed-off-by: Jesper Juhl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl
     
  • The comment for msleep_interruptible() is wrong, as it will ignore
    wait-queue events, but will wake up early for signals.

    Signed-off-by: Nishanth Aravamudan
    Signed-off-by: Domen Puncer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Domen Puncer
     
  • o Following patch provides purely cosmetic changes and corrects CodingStyle
    guide lines related certain issues like below in kexec related files

    o braces for one line "if" statements, "for" loops,
    o more than 80 column wide lines,
    o No space after "while", "for" and "switch" key words

    o Changes:
    o take-2: Removed the extra tab before "case" key words.
    o take-3: Put operator at the end of line and space before "*/"

    Signed-off-by: Maneesh Soni
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Maneesh Soni
     
  • Makes kexec_crashdump() take a pt_regs * as an argument. This allows to
    get exact register state at the point of the crash. If we come from direct
    panic assertion NULL will be passed and the current registers saved before
    crashdump.

    This hooks into two places:
    die(): check the conditions under which we will panic when calling
    do_exit and go there directly with the pt_regs that caused the fatal
    fault.

    die_nmi(): If we receive an NMI lockup while in the kernel use the
    pt_regs and go directly to crash_kexec(). We're probably nested up badly
    at this point so this might be the only chance to escape with proper
    information.

    Signed-off-by: Alexander Nyberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Nyberg
     
  • From: "Vivek Goyal"

    o Support for /proc/vmcore interface. This interface exports elf core image
    either in ELF32 or ELF64 format, depending on the format in which elf headers
    have been stored by crashed kernel.
    o Added support for CONFIG_VMCORE config option.
    o Removed the dependency on /proc/kcore.

    From: "Eric W. Biederman"

    This patch has been refactored to more closely match the prevailing style in
    the affected files. And to clearly indicate the dependency between
    /proc/kcore and proc/vmcore.c

    From: Hariprasad Nellitheertha

    This patch contains the code that provides an ELF format interface to the
    previous kernel's memory post kexec reboot.

    Signed off by Hariprasad Nellitheertha
    Signed-off-by: Eric Biederman
    Signed-off-by: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vivek Goyal
     
  • This patch adds support for retrieving the address of elf core header if one
    is passed in command line.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vivek Goyal
     
  • This patch provides the interfaces necessary to read the dump contents,
    treating it as a high memory device.

    Signed off by Hariprasad Nellitheertha
    Signed-off-by: Eric Biederman
    Signed-off-by: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vivek Goyal
     
  • o Following patch exports kexec global variable "crash_notes" to user space
    through sysfs as kernel attribute in /sys/kernel.

    Signed-off-by: Maneesh Soni
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vivek Goyal
     
  • This is a minor bug fix in kexec to resolve the problem of loading panic
    kernel with initrd.

    o Problem: Loading a capture kenrel fails if initrd is also being loaded.
    This has been observed for vmlinux image for kexec on panic case.

    o This patch fixes the problem. In segment location and size verification
    logic, minor correction has been done. Segment memory end (mend) should be
    mstart + memsz - 1. This one byte offset was source of failure for initrd
    loading which was being loaded at hole boundary.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vivek Goyal
     
  • This patch introduces the architecture independent implementation the
    sys_kexec_load, the compat_sys_kexec_load system calls.

    Kexec on panic support has been integrated into the core patch and is
    relatively clean.

    In addition the hopefully architecture independent option
    crashkernel=size@location has been docuemented. It's purpose is to reserve
    space for the panic kernel to live, and where no DMA transfer will ever be
    setup to access.

    Signed-off-by: Eric Biederman
    Signed-off-by: Alexander Nyberg
    Signed-off-by: Adrian Bunk
    Signed-off-by: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • This patch adds a new preemption model: 'Voluntary Kernel Preemption'. The
    3 models can be selected from a new menu:

    (X) No Forced Preemption (Server)
    ( ) Voluntary Kernel Preemption (Desktop)
    ( ) Preemptible Kernel (Low-Latency Desktop)

    we still default to the stock (Server) preemption model.

    Voluntary preemption works by adding a cond_resched()
    (reschedule-if-needed) call to every might_sleep() check. It is lighter
    than CONFIG_PREEMPT - at the cost of not having as tight latencies. It
    represents a different latency/complexity/overhead tradeoff.

    It has no runtime impact at all if disabled. Here are size stats that show
    how the various preemption models impact the kernel's size:

    text data bss dec hex filename
    3618774 547184 179896 4345854 424ffe vmlinux.stock
    3626406 547184 179896 4353486 426dce vmlinux.voluntary +0.2%
    3748414 548640 179896 4476950 445016 vmlinux.preempt +3.5%

    voluntary-preempt is +0.2% of .text, preempt is +3.5%.

    This feature has been tested for many months by lots of people (and it's
    also included in the RHEL4 distribution and earlier variants were in Fedora
    as well), and it's intended for users and distributions who dont want to
    use full-blown CONFIG_PREEMPT for one reason or another.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • The only sane way to clean up the current 3 lock_kernel() variants seems to
    be to remove the spinlock-based BKL implementations altogether, and to keep
    the semaphore-based one only. If we dont want to do that for whatever
    reason then i'm afraid we have to live with the current complexity. (but
    i'm open for other cleanup suggestions as well.)

    To explore this possibility we'll (at a minimum) have to know whether the
    semaphore-based BKL works fine on plain SMP too. The patch below enables
    this.

    The patch may make sense in isolation as well, as it might bring
    performance benefits: code that would formerly spin on the BKL spinlock
    will now schedule away and give up the CPU. It might introduce performance
    regressions as well, if any performance-critical code uses the BKL heavily
    and gets overscheduled due to the semaphore. I very much hope there is no
    such performance-critical codepath left though.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • This patch consolidates the CONFIG_PREEMPT and CONFIG_PREEMPT_BKL
    preemption options into kernel/Kconfig.preempt. This, besides reducing
    source-code, also enables more centralized tweaking of preemption related
    options.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Adds the core update_cpu_domains code and updated cpusets documentation

    Signed-off-by: Dinakar Guniguntala
    Acked-by: Paul Jackson
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dinakar Guniguntala
     
  • The following patches add dynamic sched domains functionality that was
    extensively discussed on lkml and lse-tech. I would like to see this added to
    -mm

    o The main advantage with this feature is that it ensures that the scheduler
    load balacing code only balances against the cpus that are in the sched
    domain as defined by an exclusive cpuset and not all of the cpus in the
    system. This removes any overhead due to load balancing code trying to
    pull tasks outside of the cpu exclusive cpuset only to be prevented by
    the tasks' cpus_allowed mask.
    o cpu exclusive cpusets are useful for servers running orthogonal
    workloads such as RT applications requiring low latency and HPC
    applications that are throughput sensitive

    o It provides a new API partition_sched_domains in sched.c
    that makes dynamic sched domains possible.
    o cpu_exclusive cpusets sets are now associated with a sched domain.
    Which means that the users can dynamically modify the sched domains
    through the cpuset file system interface
    o ia64 sched domain code has been updated to support this feature as well
    o Currently, this does not support hotplug. (However some of my tests
    indicate hotplug+preempt is currently broken)
    o I have tested it extensively on x86.
    o This should have very minimal impact on performance as none of
    the fast paths are affected

    Signed-off-by: Dinakar Guniguntala
    Acked-by: Paul Jackson
    Acked-by: Nick Piggin
    Acked-by: Matthew Dobson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dinakar Guniguntala
     
  • Presently, a process without the capability CAP_SYS_NICE can not change
    its own policy, which is OK.

    But it can also not decrease its RT priority (if scheduled with policy
    SCHED_RR or SCHED_FIFO), which is what this patch changes.

    The rationale is the same as for the nice value: a process should be
    able to require less priority for itself. Increasing the priority is
    still not allowed.

    This is for example useful if you give a multithreaded user process a RT
    priority, and the process would like to organize its internal threads
    using priorities also. Then you can give the process the highest
    priority needed N, and the process starts its threads with lower
    priorities: N-1, N-2...

    The POSIX norm says that the permissions are implementation specific, so
    I think we can do that.

    In a sense, it makes the permissions consistent whatever the policy is:
    with this patch, process scheduled by SCHED_FIFO, SCHED_RR and
    SCHED_OTHER can all decrease their priority.

    From: Ingo Molnar

    cleaned up and merged to -mm.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Olivier Croquette
     
  • micro-optimize task requeueing in schedule() & clean up recalc_task_prio().

    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Shang
     
  • The maximum rebalance interval allowed by the multiprocessor balancing
    backoff is often not large enough to handle corner cases where there are
    lots of tasks pinned on a CPU. Suresh reported:

    I see system livelock's if for example I have 7000 processes
    pinned onto one cpu (this is on the fastest 8-way system I
    have access to).

    After this patch, the machine is reported to go well above this number.

    Signed-off-by: Nick Piggin
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Consolidate balance-on-exec with balance-on-fork. This is made easy by the
    sched-domains RCU patches.

    As well as the general goodness of code reduction, this allows the runqueues
    to be unlocked during balance-on-fork.

    schedstats is a problem. Maybe just have balance-on-event instead of
    distinguishing fork and exec?

    Signed-off-by: Nick Piggin
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • One of the problems with the multilevel balance-on-fork/exec is that it needs
    to jump through hoops to satisfy sched-domain's locking semantics (that is,
    you may traverse your own domain when not preemptable, and you may traverse
    others' domains when holding their runqueue lock).

    balance-on-exec had to potentially migrate between more than one CPU before
    finding a final CPU to migrate to, and balance-on-fork needed to potentially
    take multiple runqueue locks.

    So bite the bullet and make sched-domains go completely RCU. This actually
    simplifies the code quite a bit.

    From: Ingo Molnar

    schedstats RCU fix, and a nice comment on for_each_domain, from Ingo.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Nick Piggin
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • The fundamental problem that Suresh has with balance on exec and fork is that
    it only tries to balance the top level domain with the flag set.

    This was worked around by removing degenerate domains, but is still a problem
    if people want to start using more complex sched-domains, especially
    multilevel NUMA that ia64 is already using.

    This patch makes balance on fork and exec try balancing over not just the top
    most domain with the flag set, but all the way down the domain tree.

    Signed-off-by: Nick Piggin
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Remove degenerate scheduler domains during the sched-domain init.

    For example on x86_64, we always have NUMA configured in. On Intel EM64T
    systems, top most sched domain will be of NUMA and with only one sched_group
    in it.

    With fork/exec balances(recent Nick's fixes in -mm tree), we always endup
    taking wrong decisions because of this topmost domain (as it contains only one
    group and find_idlest_group always returns NULL). We will endup loading HT
    package completely first, letting active load balance kickin and correct it.

    In general, this patch also makes sense with out recent Nick's fixes in -mm.

    From: Nick Piggin

    Modified to account for more than just sched_groups when scanning for
    degenerate domains by Nick Piggin. And allow a runqueue's sd to go NULL
    rather than keep a single degenerate domain around (this happens when you run
    with maxcpus=1).

    Signed-off-by: Suresh Siddha
    Signed-off-by: Nick Piggin
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Suresh Siddha
     
  • Fix the last 2 places that directly access a runqueue's sched-domain and
    assume it cannot be NULL.

    That allows the use of NULL for domain, instead of a dummy domain, to signify
    no balancing is to happen. No functional changes.

    Signed-off-by: Nick Piggin
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Instead of requiring architecture code to interact with the scheduler's
    locking implementation, provide a couple of defines that can be used by the
    architecture to request runqueue unlocked context switches, and ask for
    interrupts to be enabled over the context switch.

    Also replaces the "switch_lock" used by these architectures with an oncpu
    flag (note, not a potentially slow bitflag). This eliminates one bus
    locked memory operation when context switching, and simplifies the
    task_running function.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • "Chen, Kenneth W"

    uninline task_timeslice() - reduces code footprint noticeably, and it's
    slowpath code.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Add SCHEDSTAT statistics for sched-balance-fork.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Reimplement the balance on exec balancing to be sched-domains aware. Use this
    to also do balance on fork balancing. Make x86_64 do balance on fork over the
    NUMA domain.

    The problem that the non sched domains aware blancing became apparent on dual
    core, multi socket opterons. What we want is for the new tasks to be sent to
    a different socket, but more often than not, we would first load up our
    sibling core, or fill two cores of a single remote socket before selecting a
    new one.

    This gives large improvements to STREAM on such systems.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Remove the very aggressive idle stuff that has recently gone into 2.6 - it is
    going against the direction we are trying to go. Hopefully we can regain
    performance through other methods.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Do less affine wakeups. We're trying to reduce dbt2-pgsql idle time
    regressions here... make sure we don't don't move tasks the wrong way in an
    imbalance condition. Also, remove the cache coldness requirement from the
    calculation - this seems to induce sharp cutoff points where behaviour will
    suddenly change on some workloads if the load creeps slightly over or under
    some point. It is good for periodic balancing because in that case have
    otherwise have no other context to determine what task to move.

    But also make a minor tweak to "wake balancing" - the imbalance tolerance is
    now set at half the domain's imbalance, so we get the opportunity to do wake
    balancing before the more random periodic rebalancing gets preformed.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Do CPU load averaging over a number of different intervals. Allow each
    interval to be chosen by sending a parameter to source_load and target_load.
    0 is instantaneous, idx > 0 returns a decaying average with the most recent
    sample weighted at 2^(idx-1). To a maximum of 3 (could be easily increased).

    So generally a higher number will result in more conservative balancing.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Remove the special casing for idle CPU balancing. Things like this are
    hurting for example on SMT, where are single sibling being idle doesn't really
    warrant a really aggressive pull over the NUMA domain, for example.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • These conditions should now be impossible, and we need to fix them if they
    happen.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • SMT balancing has a couple of problems. Firstly, active_load_balance is too
    complex - basically it should be a dumb helper for when the periodic balancer
    has determined there is an imbalance, but gets stuck because the task is
    running.

    So rip out all its "smarts", and just make it move one task to the target CPU.

    Second, the busy CPU's sched-domain tree was being used for active balancing.
    This means that it may not see that nr_balance_failed has reached a critical
    level. So use the target CPU's sched-domain tree for this. We can do this
    because we hold its runqueue lock.

    Lastly, reset nr_balance_failed to a point where we allow cache hot migration.
    This will help ensure active load balancing is successful.

    Thanks to Suresh Siddha for pointing out these issues.

    Signed-off-by: Nick Piggin
    Signed-off-by: Suresh Siddha
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Fix up active load balancing a bit so it doesn't get called when it shouldn't.
    Reset the nr_balance_failed counter at more points where we have found
    conditions to be balanced. This reduces too aggressive active balancing seen
    on some workloads.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • John Hawkes explained the problem best:

    A large number of processes that are pinned to a single CPU results
    in every other CPU's load_balance() seeing this overloaded CPU as
    "busiest", yet move_tasks() never finds a task to pull-migrate. This
    condition occurs during module unload, but can also occur as a
    denial-of-service using sys_sched_setaffinity(). Several hundred
    CPUs performing this fruitless load_balance() will livelock on the
    busiest CPU's runqueue lock. A smaller number of CPUs will livelock
    if the pinned task count gets high.

    Expanding slightly on John's patch, this one attempts to work out whether the
    balancing failure has been due to too many tasks pinned on the runqueue. This
    allows it to be basically invisible to the regular blancing paths (ie. when
    there are no pinned tasks). We can use this extra knowledge to shut down the
    balancing faster, and ensure the migration threads don't start running which
    is another problem observed in the wild.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • New sched-domains code means we don't get spans with offline CPUs in
    them.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin