02 Feb, 2006

9 commits


01 Feb, 2006

5 commits


19 Jan, 2006

4 commits

  • EDAC requires a way to scrub memory if an ECC error is found and the chipset
    does not do the work automatically. That means rewriting memory locations
    atomically with respect to all CPUs _and_ bus masters. That means we can't
    use atomic_add(foo, 0) as it gets optimised for non-SMP

    This adds a function to include/asm-foo/atomic.h for the platforms currently
    supported which implements a scrub of a mapped block.

    It also adjusts a few other files include order where atomic.h is included
    before types.h as this now causes an error as atomic_scrub uses u32.

    Signed-off-by: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     
  • The TIF_RESTORE_SIGMASK flag allows us to have a generic implementation of
    sys_rt_sigsuspend() instead of duplicating it for each architecture. This
    provides such an implementation and makes arch/powerpc use it.

    It also tidies up the ppc32 sys_sigsuspend() to use TIF_RESTORE_SIGMASK.

    Signed-off-by: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Woodhouse
     
  • Currently, a negative policy argument passed into the
    'sys_sched_setscheduler()' system call, will return with success. However,
    the manpage for 'sys_sched_setscheduler' says:

    EINVAL The scheduling policy is not one of the recognized policies, or the
    parameter p does not make sense for the policy.

    Signed-off-by: Jason Baron
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron
     
  • proc support for zone reclaim

    This patch creates a proc entry /proc/sys/vm/zone_reclaim_mode that may be
    used to override the automatic determination of the zone reclaim made on
    bootup.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

17 Jan, 2006

3 commits


16 Jan, 2006

1 commit


15 Jan, 2006

6 commits

  • The problem, reported in:

    http://bugzilla.kernel.org/show_bug.cgi?id=5859

    and by various other email messages and lkml posts is that the cpuset hook
    in the oom (out of memory) code can try to take a cpuset semaphore while
    holding the tasklist_lock (a spinlock).

    One must not sleep while holding a spinlock.

    The fix seems easy enough - move the cpuset semaphore region outside the
    tasklist_lock region.

    This required a few lines of mechanism to implement. The oom code where
    the locking needs to be changed does not have access to the cpuset locks,
    which are internal to kernel/cpuset.c only. So I provided a couple more
    cpuset interface routines, available to the rest of the kernel, which
    simple take and drop the lock needed here (cpusets callback_sem).

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Remove useless spin_retry_counter and fix compilation for UP kernels.

    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Schwidefsky
     
  • Remove the "inline" keyword from a bunch of big functions in the kernel with
    the goal of shrinking it by 30kb to 40kb

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Ingo Molnar
    Acked-by: Jeff Garzik
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     
  • Add a new SCHED_BATCH (3) scheduling policy: such tasks are presumed
    CPU-intensive, and will acquire a constant +5 priority level penalty. Such
    policy is nice for workloads that are non-interactive, but which do not
    want to give up their nice levels. The policy is also useful for workloads
    that want a deterministic scheduling policy without interactivity causing
    extra preemptions (between that workload's tasks).

    Signed-off-by: Ingo Molnar
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • I tried to send the forcedeth maintainer an email, but it came back with:

    "The mail address manfreds@colorfullife.com is not read anymore.
    Please resent your mail to manfred@ instead of manfreds@."

    This patch fixes this.

    Signed-off-by: Adrian Bunk

    Christian Kujau
     
  • This patch fixes a typo in the dependencies of SOFTWARE_SUSPEND.

    This patch is based on a report by
    Jean-Luc Leger .

    Signed-off-by: Adrian Bunk
    Acked-by: Pavel Machek

    Adrian Bunk
     

13 Jan, 2006

3 commits

  • Linus Torvalds
     
  • )

    From: Nick Piggin

    Track the last waker CPU, and only consider wakeup-balancing if there's a
    match between current waker CPU and the previous waker CPU. This ensures
    that there is some correlation between two subsequent wakeup events before
    we move the task. Should help random-wakeup workloads on large SMP
    systems, by reducing the migration attempts by a factor of nr_cpus.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@osdl.org
     
  • )

    From: Ingo Molnar

    This is the latest version of the scheduler cache-hot-auto-tune patch.

    The first problem was that detection time scaled with O(N^2), which is
    unacceptable on larger SMP and NUMA systems. To solve this:

    - I've added a 'domain distance' function, which is used to cache
    measurement results. Each distance is only measured once. This means
    that e.g. on NUMA distances of 0, 1 and 2 might be measured, on HT
    distances 0 and 1, and on SMP distance 0 is measured. The code walks
    the domain tree to determine the distance, so it automatically follows
    whatever hierarchy an architecture sets up. This cuts down on the boot
    time significantly and removes the O(N^2) limit. The only assumption
    is that migration costs can be expressed as a function of domain
    distance - this covers the overwhelming majority of existing systems,
    and is a good guess even for more assymetric systems.

    [ People hacking systems that have assymetries that break this
    assumption (e.g. different CPU speeds) should experiment a bit with
    the cpu_distance() function. Adding a ->migration_distance factor to
    the domain structure would be one possible solution - but lets first
    see the problem systems, if they exist at all. Lets not overdesign. ]

    Another problem was that only a single cache-size was used for measuring
    the cost of migration, and most architectures didnt set that variable
    up. Furthermore, a single cache-size does not fit NUMA hierarchies with
    L3 caches and does not fit HT setups, where different CPUs will often
    have different 'effective cache sizes'. To solve this problem:

    - Instead of relying on a single cache-size provided by the platform and
    sticking to it, the code now auto-detects the 'effective migration
    cost' between two measured CPUs, via iterating through a wide range of
    cachesizes. The code searches for the maximum migration cost, which
    occurs when the working set of the test-workload falls just below the
    'effective cache size'. I.e. real-life optimized search is done for
    the maximum migration cost, between two real CPUs.

    This, amongst other things, has the positive effect hat if e.g. two
    CPUs share a L2/L3 cache, a different (and accurate) migration cost
    will be found than between two CPUs on the same system that dont share
    any caches.

    (The reliable measurement of migration costs is tricky - see the source
    for details.)

    Furthermore i've added various boot-time options to override/tune
    migration behavior.

    Firstly, there's a blanket override for autodetection:

    migration_cost=1000,2000,3000

    will override the depth 0/1/2 values with 1msec/2msec/3msec values.

    Secondly, there's a global factor that can be used to increase (or
    decrease) the autodetected values:

    migration_factor=120

    will increase the autodetected values by 20%. This option is useful to
    tune things in a workload-dependent way - e.g. if a workload is
    cache-insensitive then CPU utilization can be maximized by specifying
    migration_factor=0.

    I've tested the autodetection code quite extensively on x86, on 3
    P3/Xeon/2MB, and the autodetected values look pretty good:

    Dual Celeron (128K L2 cache):

    ---------------------
    migration cost matrix (max_cache_size: 131072, cpu: 467 MHz):
    ---------------------
    [00] [01]
    [00]: - 1.7(1)
    [01]: 1.7(1) -
    ---------------------
    cacheflush times [2]: 0.0 (0) 1.7 (1784008)
    ---------------------

    Here the slow memory subsystem dominates system performance, and even
    though caches are small, the migration cost is 1.7 msecs.

    Dual HT P4 (512K L2 cache):

    ---------------------
    migration cost matrix (max_cache_size: 524288, cpu: 2379 MHz):
    ---------------------
    [00] [01] [02] [03]
    [00]: - 0.4(1) 0.0(0) 0.4(1)
    [01]: 0.4(1) - 0.4(1) 0.0(0)
    [02]: 0.0(0) 0.4(1) - 0.4(1)
    [03]: 0.4(1) 0.0(0) 0.4(1) -
    ---------------------
    cacheflush times [2]: 0.0 (33900) 0.4 (448514)
    ---------------------

    Here it can be seen that there is no migration cost between two HT
    siblings (CPU#0/2 and CPU#1/3 are separate physical CPUs). A fast memory
    system makes inter-physical-CPU migration pretty cheap: 0.4 msecs.

    8-way P3/Xeon [2MB L2 cache]:

    ---------------------
    migration cost matrix (max_cache_size: 2097152, cpu: 700 MHz):
    ---------------------
    [00] [01] [02] [03] [04] [05] [06] [07]
    [00]: - 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
    [01]: 19.2(1) - 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
    [02]: 19.2(1) 19.2(1) - 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
    [03]: 19.2(1) 19.2(1) 19.2(1) - 19.2(1) 19.2(1) 19.2(1) 19.2(1)
    [04]: 19.2(1) 19.2(1) 19.2(1) 19.2(1) - 19.2(1) 19.2(1) 19.2(1)
    [05]: 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) - 19.2(1) 19.2(1)
    [06]: 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) - 19.2(1)
    [07]: 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) -
    ---------------------
    cacheflush times [2]: 0.0 (0) 19.2 (19281756)
    ---------------------

    This one has huge caches and a relatively slow memory subsystem - so the
    migration cost is 19 msecs.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Ashok Raj
    Signed-off-by: Ken Chen
    Cc:
    Signed-off-by: John Hawkes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@osdl.org
     

12 Jan, 2006

9 commits

  • Roman Zippel pointed out that the missing lower limit of intervals
    leads to an accounting error in the overrun count. Enforce the lower
    limit of intervals to resolution in the timer forwarding code.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Change the storage format of the per base resolution to ktime_t to
    make it easier accessible in the hrtimers code.

    Change the resolution from (NSEC_PER_SEC/HZ) to TICK_NSEC as Roman
    pointed out. TICK_NSEC is closer to the real resolution.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • The list_head in the hrtimer structure was introduced for easy access
    to the first timer with the further extensions of real high resolution
    timers in mind, but it turned out in the course of development that
    it is not necessary for the standard use case. Remove the list head
    and access the first expiry timer by a datafield in the timer base.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • vSMP specific alignment patch to
    1. Define INTERNODE_CACHE_SHIFT for vSMP
    2. Use this for alignment of critical structures
    3. Use INTERNODE_CACHE_SHIFT for ARCH_MIN_TASKALIGN,
    and let the slab align task_struct allocations to the internode cacheline size
    4. Introduce and use ARCH_MIN_MMSTRUCT_ALIGN for mm_struct slab allocations.

    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • They are referred to often so avoid potential false sharing for them.

    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • - Move capable() from sched.h to capability.h;

    - Use where capable() is used
    (in include/, block/, ipc/, kernel/, a few drivers/,
    mm/, security/, & sound/;
    many more drivers/ to go)

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy.Dunlap
     
  • Uninline capable(). Saves 2K of kernel text on a generic .config, and 1K on a
    tiny config. In addition it makes the use of capable more consistent between
    CONFIG_SECURITY and !CONFIG_SECURITY

    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • When a kprobes modules is written in such a way that probes are inserted on
    itself, then unload of that moudle was not possible due to reference
    couning on the same module.

    The below patch makes a check and incrementes the module refcount only if
    it is not a self probed module.

    We need to allow modules to probe themself for kprobes performance
    measurements

    This patch has been tested on several x86_64, ppc64 and IA64 architectures.

    Signed-off-by: Anil S Keshavamurthy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Keshavamurthy Anil S
     
  • Let's switch mutex_debug_check_no_locks_freed() to take (addr, len) as
    arguments instead, since all its callers were just calculating the 'to'
    address for themselves anyway... (and sometimes doing so badly).

    Signed-off-by: David Woodhouse
    Acked-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    David Woodhouse