09 Jan, 2006
40 commits
-
The important code paths through alloc_pages_current() and alloc_page_vma(),
by which most kernel page allocations go, both called
cpuset_update_current_mems_allowed(), which in turn called refresh_mems().
-Both- of these latter two routines did a tasklock, got the tasks cpuset
pointer, and checked for out of date cpuset->mems_generation.That was a silly duplication of code and waste of CPU cycles on an important
code path.Consolidated those two routines into a single routine, called
cpuset_update_task_memory_state(), since it updates more than just
mems_allowed.Changed all callers of either routine to call the new consolidated routine.
Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Fix obscure, never seen in real life, cpuset fork race. The cpuset_fork()
call in fork.c was setting up the correct task->cpuset pointer after the
tasklist_lock was dropped, which briefly exposed the newly forked process with
an unsafe (copied from parent without locks or usage counter increment) cpuset
pointer.In theory, that exposed cpuset pointer could have been pointing at a cpuset
that was already freed and removed, and in theory another task that had been
sitting on the tasklist_lock waiting to scan the task list could have raced
down the entire tasklist, found our new child at the far end, and dereferenced
that bogus cpuset pointer.To fix, setup up the correct cpuset pointer in the new child by calling
cpuset_fork() before the new task is linked into the tasklist, and with that,
add a fork failure case, to dereference that cpuset, if the fork fails along
the way, after cpuset_fork() was called.Had to remove a BUG_ON() from cpuset_exit(), because it was no longer valid -
the call to cpuset_exit() from a failed fork would not have PF_EXITING set.Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Restructure code layout of the kernel/cpuset.c update_nodemask() routine,
removing embedded returns and nested if's in favor of goto completion labels.
This is being done in anticipation of adding more logic to this routine, which
will favor the goto style structure.Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Four trivial cpuset fixes: remove extra spaces, remove useless initializers,
mark one __read_mostly.Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Remove documentation for the cpuset 'marker_pid' feature, that was in the
patch "cpuset: change marker for relative numbering" That patch was previously
pulled from *-mm at my (pj) request.Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Document the additional cpuset features:
notify_on_release
marker_pid
memory_pressure
memory_pressure_enabledRearrange and improve formatting of existing documentation for
cpu_exclusive and mem_exclusive features.Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
that the tasks in a cpuset call try_to_free_pages(), the synchronous
(direct) memory reclaim code.This enables batch managers monitoring jobs running in dedicated cpusets to
efficiently detect what level of memory pressure that job is causing.This is useful both on tightly managed systems running a wide mix of
submitted jobs, which may choose to terminate or reprioritize jobs that are
trying to use more memory than allowed on the nodes assigned them, and with
tightly coupled, long running, massively parallel scientific computing jobs
that will dramatically fail to meet required performance goals if they
start to use more memory than allowed to them.This patch just provides a very economical way for the batch manager to
monitor a cpuset for signs of memory pressure. It's up to the batch
manager or other user code to decide what to do about it and take action.==> Unless this feature is enabled by writing "1" to the special file
/dev/cpuset/memory_pressure_enabled, the hook in the rebalance
code of __alloc_pages() for this metric reduces to simply noticing
that the cpuset_memory_pressure_enabled flag is zero. So only
systems that enable this feature will compute the metric.Why a per-cpuset, running average:
Because this meter is per-cpuset, rather than per-task or mm, the
system load imposed by a batch scheduler monitoring this metric is
sharply reduced on large systems, because a scan of the tasklist can be
avoided on each set of queries.Because this meter is a running average, instead of an accumulating
counter, a batch scheduler can detect memory pressure with a single
read, instead of having to read and accumulate results for a period of
time.Because this meter is per-cpuset rather than per-task or mm, the
batch scheduler can obtain the key information, memory pressure in a
cpuset, with a single read, rather than having to query and accumulate
results over all the (dynamically changing) set of tasks in the cpuset.A per-cpuset simple digital filter (requires a spinlock and 3 words of data
per-cpuset) is kept, and updated by any task attached to that cpuset, if it
enters the synchronous (direct) page reclaim code.A per-cpuset file provides an integer number representing the recent
(half-life of 10 seconds) rate of direct page reclaims caused by the tasks
in the cpuset, in units of reclaims attempted per second, times 1000.Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Finish converting mm/mempolicy.c from bitmaps to nodemasks. The previous
conversion had left one routine using bitmaps, since it involved a
corresponding change to kernel/cpuset.cFix that interface by replacing with a simple macro that calls nodes_subset(),
or if !CONFIG_CPUSET, returns (1).Signed-off-by: Paul Jackson
Cc: Christoph Lameter
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Fix the default behaviour for the remap operators in bitmap, cpumask and
nodemask.As previously submitted, the pair of masks defined a map of the
positions of the set bits in A to the corresponding bits in B. This is still
true.The issue is how to map the other positions, corresponding to the unset (0)
bits in A. As previously submitted, they were all mapped to the first set bit
position in B, a constant map.When I tried to code per-vma mempolicy rebinding using these remap operators,
I realized this was wrong.This patch changes the default to map all the unset bit positions in A to the
same positions in B, the identity map.For example, if A has bits 4-7 set, and B has bits 9-12 set, then the map
defined by the pair maps each bit position in the first 32 bits as
follows:0 ==> 0
...
3 ==> 3
4 ==> 9
...
7 ==> 12
8 ==> 8
9 ==> 9
...
31 ==> 31This now corresponds to the typical behaviour desired when migrating pages and
policies from one cpuset to another.The pages on nodes within the original cpuset, and the references in memory
policies to nodes within the original cpuset, are migrated to the
corresponding cpuset-relative nodes in the destination cpuset. Other pages
and node references are left untouched.Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
configurable replacement for slab allocator
This adds a CONFIG_SLAB option under CONFIG_EMBEDDED. When CONFIG_SLAB is
disabled, the kernel falls back to using the 'SLOB' allocator.SLOB is a traditional K&R/UNIX allocator with a SLAB emulation layer,
similar to the original Linux kmalloc allocator that SLAB replaced. It's
signicantly smaller code and is more memory efficient. But like all
similar allocators, it scales poorly and suffers from fragmentation more
than SLAB, so it's only appropriate for small systems.It's been tested extensively in the Linux-tiny tree. I've also
stress-tested it with make -j 8 compiles on a 3G SMP+PREEMPT box (not
recommended).Here's a comparison for otherwise identical builds, showing SLOB saving
nearly half a megabyte of RAM:$ size vmlinux*
text data bss dec hex filename
3336372 529360 190812 4056544 3de5e0 vmlinux-slab
3323208 527948 190684 4041840 3dac70 vmlinux-slob$ size mm/{slab,slob}.o
text data bss dec hex filename
13221 752 48 14021 36c5 mm/slab.o
1896 52 8 1956 7a4 mm/slob.o/proc/meminfo:
SLAB SLOB delta
MemTotal: 27964 kB 27980 kB +16 kB
MemFree: 24596 kB 25092 kB +496 kB
Buffers: 36 kB 36 kB 0 kB
Cached: 1188 kB 1188 kB 0 kB
SwapCached: 0 kB 0 kB 0 kB
Active: 608 kB 600 kB -8 kB
Inactive: 808 kB 812 kB +4 kB
HighTotal: 0 kB 0 kB 0 kB
HighFree: 0 kB 0 kB 0 kB
LowTotal: 27964 kB 27980 kB +16 kB
LowFree: 24596 kB 25092 kB +496 kB
SwapTotal: 0 kB 0 kB 0 kB
SwapFree: 0 kB 0 kB 0 kB
Dirty: 4 kB 12 kB +8 kB
Writeback: 0 kB 0 kB 0 kB
Mapped: 560 kB 556 kB -4 kB
Slab: 1756 kB 0 kB -1756 kB
CommitLimit: 13980 kB 13988 kB +8 kB
Committed_AS: 4208 kB 4208 kB 0 kB
PageTables: 28 kB 28 kB 0 kB
VmallocTotal: 1007312 kB 1007312 kB 0 kB
VmallocUsed: 48 kB 48 kB 0 kB
VmallocChunk: 1007264 kB 1007264 kB 0 kB(this work has been sponsored in part by CELF)
From: Ingo Molnar
Fix 32-bitness bugs in mm/slob.c.
Signed-off-by: Matt Mackall
Signed-off-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add mm/util.c for functions common between SLAB and SLOB.
Signed-off-by: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Make DEBUG_SLAB depend on SLAB.
Signed-off-by: Ingo Molnar
Cc: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Shrink the height of a radix tree when it is partially truncated - we only do
shrinkage of full truncation at present.Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Correctly determine the tags to be cleared in radix_tree_delete() so we
don't keep moving up the tree clearing tags that we don't need to. For
example, if a tag is simply not set in the deleted item, nor anywhere up
the tree, radix_tree_delete() would attempt to clear it up the entire
height of the tree.Also, tag_set() was made conditional so as not to dirty too many cachelines
high up in the radix tree. Instead, put this logic into
radix_tree_tag_set().Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Introduce helper any_tag_set() rather than repeat the same code sequence 4
times.Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The latest set of signal-RCU patches does not use get_task_struct_rcu().
Attached is a patch that removes it.Signed-off-by: "Paul E. McKenney"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Some simplification in checking signal delivery against concurrent exit.
Instead of using get_task_struct_rcu(), which increments the task_struct
reference count, check the reference count after acquiring sighand lock.Signed-off-by: "Paul E. McKenney"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
RCU tasklist_lock and RCU signal handling: send signals RCU-read-locked
instead of tasklist_lock read-locked. This is a scalability improvement on
SMP and a preemption-latency improvement under PREEMPT_RCU.Signed-off-by: Paul E. McKenney
Signed-off-by: Ingo Molnar
Acked-by: William Irwin
Cc: Roland McGrath
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Applying RCU to the task structure broke oprofile, because
free_task_notify() can now be called from softirq. This means that the
task_mortuary lock must be acquired with irq disabled in order to avoid
intermittent self-deadlock. Since irq is now disabled, the critical
section within process_task_mortuary() has been restructured to be O(1) in
order to maximize scalability and minimize realtime latency degradation.Kudos to Wu Fengguang for finding this problem!
CC: Wu Fengguang
Cc: Philippe Elie
Cc: John Levon
Signed-off-by: "Paul E. McKenney"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If MODE_TT=n, MODE_SKAS must be y.
Signed-off-by: Adrian Bunk
Acked-by: Jeff Dike
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This fixes some mangled whitespace added by the earlier trap_user.c patch.
Signed-off-by: Jeff Dike
Cc: Paolo 'Blaisorblade' Giarrusso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Most of the architectures have the same asm/futex.h. This consolidates them
into asm-generic, with the arches including it from their own asm/futex.h.In the case of UML, this reverts the old broken futex.h and goes back to using
the same one as almost everyone else.Signed-off-by: Jeff Dike
Cc: Paolo 'Blaisorblade' Giarrusso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The serial UML OS-abstraction layer patch (um/kernel dir).
This joins trap_user.c and trap_kernel.c files.
Signed-off-by: Gennady Sharapov
Signed-off-by: Jeff Dike
Cc: Paolo 'Blaisorblade' Giarrusso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The serial UML OS-abstraction layer patch (um/kernel dir).
This moves all systemcalls from trap_user.c file under os-Linux dir
Signed-off-by: Gennady Sharapov
Signed-off-by: Jeff Dike
Cc: Paolo 'Blaisorblade' Giarrusso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The serial UML OS-abstraction layer patch (um/kernel dir).
This moves all systemcalls from signal_user.c file under os-Linux dir
Signed-off-by: Gennady Sharapov
Signed-off-by: Jeff Dike
Cc: Paolo 'Blaisorblade' Giarrusso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
ds1620 module is using gpio_read symbol, so works only if "built-in" symbol
needs to be exported from the kernel imageSigned-off-by: Woody Suwalski
Cc: Russell King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Kill L1_CACHE_SHIFT from all arches. Since L1_CACHE_SHIFT_MAX is not used
anymore with the introduction of INTERNODE_CACHE, kill L1_CACHE_SHIFT_MAX.Signed-off-by: Ravikiran Thirumalai
Signed-off-by: Shai Fultheim
Signed-off-by: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
____cacheline_maxaligned_in_smp is currently used to align critical structures
and avoid false sharing. It uses per-arch L1_CACHE_SHIFT_MAX and people find
L1_CACHE_SHIFT_MAX useless.However, we have been using ____cacheline_maxaligned_in_smp to align
structures on the internode cacheline size. As per Andi's suggestion,
following patch kills ____cacheline_maxaligned_in_smp and introduces
INTERNODE_CACHE_SHIFT, which defaults to L1_CACHE_SHIFT for all arches.
Arches needing L3/Internode cacheline alignment can define
INTERNODE_CACHE_SHIFT in the arch asm/cache.h. Patch replaces
____cacheline_maxaligned_in_smp with ____cacheline_internodealigned_in_smpWith this patch, L1_CACHE_SHIFT_MAX can be killed
Signed-off-by: Ravikiran Thirumalai
Signed-off-by: Shai Fultheim
Signed-off-by: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Fix an uninitialised variable warning in the serverworks driver.
Signed-off-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Fix an uninitialised variable warning in the atm nicstar driver.
Signed-off-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Fix a number of miscellanous items:
(1) Declare lock sections in the linker script.
(2) Recurse in the correct manner in the arch makefile.
(3) asm/bug.h requires asm/linkage.h to be included first. One C file puts
asm/bug.h first.(4) Add an empty RTC header file to avoid missing header file errors.
(5) sg_dma_address() should use the dma_address member of a scatter list.
(6) Add trivial pci_unmap support.
(7) Add pgprot_noncached()
(8) Discard u_quad_t.
(9) Use ~0UL rather than ULONG_MAX in unistd.h in case the latter isn't
declared.(10) Add an empty VGA header file to avoid missing header file errors.
(11) Add an XOR header file to use the generic XOR stuff.
Signed-off-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Make the get_user macro cast the source pointer to an appropriate type for the
specified size.Signed-off-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Force the 8230 serial driver to be built in if the on-CPU UARTs are to be
used. It can't be used as a module because the arch setup needs to call into
it.Signed-off-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Fix PCMCIA configuration for FRV by including the stock PCMCIA configuration
description file.Signed-off-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Implement pci_iomap() for FRV.
Signed-off-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add stubs for FRV module support.
Signed-off-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Supply various I/O access primitives that are missing for the FRV arch:
(*) mmiowb()
(*) read*_relaxed()
(*) ioport_*map()
(*) ioread*(), iowrite*(), ioread*_rep() and iowrite*_rep()
(*) pci_io*map()
(*) check_signature()
The patch also makes __is_PCI_addr() more efficient.
Signed-off-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Fix the exception table handling so that modules exceptions are dealt with.
Signed-off-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Export a number of features required to build all the modules. It also
implements the following simple features:(*) csum_partial_copy_from_user() for MMU as well as no-MMU.
(*) __ucmpdi2().
so that they can be exported too.
Signed-off-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Drop support for debugging features that aren't supported on FRV:
(*) EARLY_PRINTK
The on-chip UARTs are set up early enough that this isn't required,
and VGA support isn't available. There's also a gdbstub available.(*) DEBUG_PAGEALLOC
This can't be easily be done since we use huge static mappings to
cover the kernel, not pages.Signed-off-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds