17 Jul, 2007
1 commit
-
Make zonelist creation policy selectable from sysctl/boot option v6.
This patch makes NUMA's zonelist (of pgdat) order selectable.
Available order are Default(automatic)/ Node-based / Zone-based.[Default Order]
The kernel selects Node-based or Zone-based order automatically.[Node-based Order]
This policy treats the locality of memory as the most important parameter.
Zonelist order is created by each zone's locality. This means lower zones
(ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion.
IOW. ZONE_DMA will be in the middle of zonelist.
current 2.6.21 kernel uses this.Pros.
* A user can expect local memory as much as possible.
Cons.
* lower zone will be exhansted before higher zone. This may cause OOM_KILL.Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL
because of ZONE_DMA exhaution and you need the best locality.(example)
assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.*node(0)'s memory allocation order:
node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL.
*node(1)'s memory allocation order:
node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.
[Zone-based order]
This policy treats the zone type as the most important parameter.
Zonelist order is created by zone-type order. This means lower zone
never be used bofere higher zone exhaustion.
IOW. ZONE_DMA will be always at the tail of zonelist.Pros.
* OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted.
Cons.
* memory locality may not be best.(example)
assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.*node(0)'s memory allocation order:
node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA.
*node(1)'s memory allocation order:
node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.
bootoption "numa_zonelist_order=" and proc/sysctl is supporetd.
command:
%echo N > /proc/sys/vm/numa_zonelist_orderWill rebuild zonelist in Node-based order.
command:
%echo Z > /proc/sys/vm/numa_zonelist_orderWill rebuild zonelist in Zone-based order.
Thanks to Lee Schermerhorn, he gives me much help and codes.
[Lee.Schermerhorn@hp.com: add check_highest_zone to build_zonelists_in_zone_order]
[akpm@linux-foundation.org: build fix]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Lee Schermerhorn
Cc: Christoph Lameter
Cc: Andi Kleen
Cc: "jesse.barnes@intel.com"
Signed-off-by: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Jul, 2007
1 commit
-
Add a new security check on mmap operations to see if the user is attempting
to mmap to low area of the address space. The amount of space protected is
indicated by the new proc tunable /proc/sys/vm/mmap_min_addr and defaults to
0, preserving existing behavior.This patch uses a new SELinux security class "memprotect." Policy already
contains a number of allow rules like a_t self:process * (unconfined_t being
one of them) which mean that putting this check in the process class (its
best current fit) would make it useless as all user processes, which we also
want to protect against, would be allowed. By taking the memprotect name of
the new class it will also make it possible for us to move some of the other
memory protect permissions out of 'process' and into the new class next time
we bump the policy version number (which I also think is a good future idea)Acked-by: Stephen Smalley
Acked-by: Chris Wright
Signed-off-by: Eric Paris
Signed-off-by: James Morris
10 Jul, 2007
3 commits
-
This patch removes xip_file_sendfile, the sendfile implementation for
xip without replacement. Those customers that use xip on s390 are not
using sendfile() as far as we know, and so far s390 is the only platform
this could potentially be used on so far.
Having sendfile is not a popular feature for execute in place file
systems, however we have a working implementation of splice_read() based
on fs/splice.c if anyone asks for it.
At this point in time, it does not seem preferable to merge
splice_read() for xip because it causes extra maintenence effort due to
code duplication and it requires struct page behind the xip memory
segment. We'd like to get rid of that in favor of supporting flash based
embedded platforms (Monta Vista work) soon.Signed-off-by: Carsten Otte
Signed-off-by: Jens Axboe -
Remove shmem_file_sendfile and resurrect shmem_readpage, as used by tmpfs
to support loop and sendfile in 2.4 and 2.5. Now tmpfs can support splice,
loop and sendfile in the simplest way, using generic_file_splice_read and
generic_file_splice_write (with the aid of shmem_prepare_write).We could make some efficiency tweaks later, if there's a real need;
but this is stable and works well as is.Signed-off-by: Hugh Dickins
Signed-off-by: Jens Axboe -
It's no longer used.
Signed-off-by: Jens Axboe
09 Jul, 2007
1 commit
-
Fix a post-2.6.21 regression.
read_cache_page_async() has two invocations of mark_page_accessed() which will
launch pages right onto the active list.Remove the first one, keeping the latter one. This avoids marking unwanted
pages active (in the retry loop).Signed-off-by: Peter Zijlstra
Acked-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Jul, 2007
2 commits
-
kmem_cache_open is static. EXPORT_SYMBOL was leftover from some earlier
time period where kmem_cache_open was usable outside of slub.(Fixes powerpc build error)
Signed-off-by: Chrsitoph Lameter
Cc: Johannes Berg
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Line up the vmstat_text with zone_stat_item
enum zone_stat_item {
/* First 128 byte cacheline (assuming 64 bit words) */
NR_FREE_PAGES,
NR_INACTIVE,
NR_ACTIVE,We current have nr_active and nr_inactive reversed.
[ "OK with patch, though using initializers canbe handy to prevent such
things in future:static const char * const vmstat_text[] = {
[NR_FREE_PAGES] = "nr_free_pages",
..."
- Alexey ]Signed-off-by: Peter Zijlstra
Acked-by: Alexey Dobriyan
Signed-off-by: Linus Torvalds
06 Jul, 2007
1 commit
-
Commit b46b8f19c9cd435ecac4d9d12b39d78c137ecd66 fixed a couple of bugs
by switching the redzone to 64 bits. Unfortunately, it neglected to
ensure that the _second_ redzone, after the slab object, is aligned
correctly. This caused illegal instruction faults on sparc32, which for
some reason not entirely clear to me are not trapped and fixed up.Two things need to be done to fix this:
- increase the object size, rounding up to alignof(long long) so
that the second redzone can be aligned correctly.
- If SLAB_STORE_USER is set but alignof(long long)==8, allow a
full 64 bits of space for the user word at the end of the buffer,
even though we may not _use_ the whole 64 bits.This patch should be a no-op on any 64-bit architecture or any 32-bit
architecture where alignof(long long) == 4. Of the others, it's tested
on ppc32 by myself and a very similar patch was tested on sparc32 by
Mark Fortescue, who reported the new problem.Also, fix the conditions for FORCED_DEBUG, which hadn't been adjusted to
the new sizes. Again noticed by Mark.Signed-off-by: David Woodhouse
Signed-off-by: Linus Torvalds
04 Jul, 2007
1 commit
-
If we move the local_irq_enable() to the end of the function then
add_partial() in early_kmem_cache_node_alloc() will be called
with interrupts disabled like during regular operations.This makes lockdep happy.
Signed-off-by: Christoph Lameter
Tested-by: Andre Noll
Acked-by: Ingo Molnar
Signed-off-by: Linus Torvalds
02 Jul, 2007
1 commit
-
We agreed to remove the WARN_ON_ONCE before 2.6.22 is released.
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
29 Jun, 2007
1 commit
-
validate_anon_vma gave a useful check on the integrity of the anon_vma list
when Andrea was developing obj rmap; but it was not enabled in SLES9
itself, nor in mainline, until Nick changed commented-out RMAP_DEBUG to
configurable CONFIG_DEBUG_VM in 2.6.17. Now Petr Vandrovec reports that
its BUG_ON(mapcount > 100000) can easily crash a CONFIG_DEBUG_VM=y system.That limit was just an arbitrary number to protect against an infinite
loop. We could raise it to something enormous (depending on sizeof struct
vma and size of memory?); but I rather think validate_anon_vma has outlived
its usefulness, and is better just removed - which gives a magnificent
performance boost to anything like Petr's test program ;)Of course, a very long anon_vma list is bad news for preemption latency,
and I believe there has been one recent report of such: let's not forget
that, but validate_anon_vma only makes it worse not better.Signed-off-by: Hugh Dickins
Cc: Petr Vandrovec
Acked-by: Nick Piggin
Cc: Andrea Arcangeli
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
24 Jun, 2007
1 commit
-
If slabs are allocated or freed from a large set of call sites (typical for
the kmalloc area) then we may create more output than fits into a single
PAGE and sysfs only gives us one page. The output should be truncated.
This patch fixes the checks to do the truncation properly.Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
22 Jun, 2007
1 commit
-
Function expand_upwards() did not guarded against wrapping
around to address 0. This fixes the adjtimex02 testcase from
the Linux Test Project on a 32bit PARISC kernel.[expand_upwards is only used on parisc and ia64; it looks like it does
the right thing on both. --kyle]Signed-off-by: Helge Deller
Cc: Tony Luck
Signed-off-by: Kyle McMartin
17 Jun, 2007
3 commits
-
If ARCH_KMALLOC_MINALIGN is set to a value greater than 8 (SLUBs smallest
kmalloc cache) then SLUB may generate duplicate slabs in sysfs (yes again)
because the object size is padded to reach ARCH_KMALLOC_MINALIGN. Thus the
size of the small slabs is all the same.No arch sets ARCH_KMALLOC_MINALIGN larger than 8 though except mips which
for some reason wants a 128 byte alignment.This patch increases the size of the smallest cache if
ARCH_KMALLOC_MINALIGN is greater than 8. In that case more and more of the
smallest caches are disabled.If we do that then the count of the active general caches that is displayed
on boot is not correct anymore since we may skip elements of the kmalloc
array. So count them separately.This approach was tested by Havard yesterday.
Signed-off-by: Christoph Lameter
Cc: Haavard Skinnemoen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Some changes done a while ago to avoid pounding on ptep_set_access_flags and
update_mmu_cache in some race situations break sun4c which requires
update_mmu_cache() to always be called on minor faults.This patch reworks ptep_set_access_flags() semantics, implementations and
callers so that it's now responsible for returning whether an update is
necessary or not (basically whether the PTE actually changed). This allow
fixing the sparc implementation to always return 1 on sun4c.[akpm@linux-foundation.org: fixes, cleanups]
Signed-off-by: Benjamin Herrenschmidt
Cc: Hugh Dickins
Cc: David Miller
Cc: Mark Fortescue
Acked-by: William Lee Irwin III
Cc: "Luck, Tony"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The data structure to manage the information gathered about functions
allocating and freeing objects is allocated when the list_lock has already
been taken. We need to allocate with GFP_ATOMIC instead of GFP_KERNEL.Signed-off-by: Christoph Lameter
Cc: Mel Gorman
Cc: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
16 Jun, 2007
1 commit
-
When building with memory hotplug enabled and cpu hotplug disabled, we
end up with the following section mismatch:WARNING: mm/built-in.o(.text+0x4e58): Section mismatch: reference to
.init.text: (between 'free_area_init_node' and '__build_all_zonelists')This happens as a result of:
-> free_area_init_node()
-> free_area_init_core()
-> zone_pcp_init() zone_batchsize()
Acked-by: Yasunori Goto
Signed-off-by: Linus Torvalds--
mm/page_alloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
09 Jun, 2007
4 commits
-
into the appropriate #ifdef.
Signed-off-by: Stephen Rothwell
Cc: Yasunori Goto
Cc: Andy Whitcroft
Cc: Badari Pulavarty
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Instead of returning the smallest available object return ZERO_SIZE_PTR.
A ZERO_SIZE_PTR can be legitimately used as an object pointer as long as it
is not deferenced. The dereference of ZERO_SIZE_PTR causes a distinctive
fault. kfree can handle a ZERO_SIZE_PTR in the same way as NULL.This enables functions to use zero sized object. e.g. n = number of objects.
objects = kmalloc(n * sizeof(object));
for (i = 0; i < n; i++)
objects[i].x = y;kfree(objects);
Signed-off-by: Christoph Lameter
Acked-by: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
cache_free_alien must be called regardless if we use alien caches or not.
cache_free_alien() will do the right thing if there are no alien caches
available.Signed-off-by: Christoph Lameter
Cc: Paul Mundt
Acked-by: Pekka J Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Randy Dunlap reports that a tmpfs, mounted with NUMA mpol= specifying an
offline node, crashes as soon as data is allocated upon it. Now restrict it
to online nodes, where before it restricted to MAX_NUMNODES.Signed-off-by: Hugh Dickins
Cc: Robin Holt
Cc: Christoph Lameter
Cc: Andi Kleen
Tested-and-acked-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
01 Jun, 2007
3 commits
-
Hotplug callbacks are performed with interrupts enabled. Slub requires
interrupts to be disabled for flushing caches.Signed-off-by: Christoph Lameter
Cc: Michal Piotrowski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
zone->present_pages is updated in online_pages(). But, __add_zone() can be
called twice or more before calling online_pages(). So,
init_currenty_empty_zone() can be called unnecessary times. It is cause of
memory leak of zone's wait_table.Signed-off-by: Yasunori Goto
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
On systems with huge amount of physical memory, VFS cache and memory memmap
may eat all available system memory under 4G, then the system may fail to
allocate swiotlb bounce buffer.There was a fix for this issue in arch/x86_64/mm/numa.c, but that fix dose
not cover sparsemem model.This patch add fix to sparsemem model by first try to allocate memmap above
4G.Signed-off-by: Zou Nan hai
Acked-by: Suresh Siddha
Cc: Andi Kleen
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
31 May, 2007
2 commits
-
Fix support for discontinuous memory
Signed-off-by: Roman Zippel
Signed-off-by: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We need this patch in ASAP. Patch fixes the mysterious hang that remained
on some particular configurations with lockdep on after the first fix that
moved the #idef CONFIG_SLUB_DEBUG to the right location. See
http://marc.info/?t=117963072300001&r=1&w=2The kmem_cache_node cache is very special because it is needed for NUMA
bootstrap. Under certain conditions (like for example if lockdep is
enabled and significantly increases the size of spinlock_t) the structure
may become exactly the size as one of the larger caches in the kmalloc
array.That early during bootstrap we cannot perform merging properly. The unique
id for the kmem_cache_node cache will match one of the kmalloc array.
Sysfs will complain about a duplicate directory entry. All of this occurs
while the console is not yet fully operational. Thus boot may appear to be
silently failing.The kmem_cache_node cache is very special. During early boostrap the main
allocation function is not operational yet and so we have to run our own
small special alloc function during early boot. It is also special in that
it is never freed.We really do not want any merging on that cache. Set the refcount -1 and
forbid merging of slabs that have a negative refcount.Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
24 May, 2007
3 commits
-
The check for super sized slabs where we can no longer move the free
pointer behind the object for debugging purposes etc is accessing a
field that is not setup yet. We must use objsize here since the size of
the slab has not been determined yet.The effect of this is that a global slab shrink via "slabinfo -s" will
show errors about offsets being wrong if booted with slub_debug.
Potentially there are other troubles with huge slabs under slub_debug
because the calculated free pointer offset is truncated.Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
mm/page_alloc.c:931: warning: 'setup_nr_node_ids' defined but not used
This is now the only (!) compiler warning I get in my UML build :)
Signed-off-by: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The object size calculation is wrong if !CONFIG_SLUB_DEBUG because the
#ifdef CONFIG_SLUB_DEBUG is now switching off the size adjustments for
DESTROY_BY_RCU and ctor.Signed-off-by: Christoph Lameter
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
22 May, 2007
2 commits
-
* git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild-fix:
mm/slab: fix section mismatch warning
mm: fix section mismatch warnings
init/main: use __init_refok to fix section mismatch
kbuild: introduce __init_refok/__initdata_refok to supress section mismatch warnings
all-archs: consolidate .data section definition in asm-generic
all-archs: consolidate .text section definition in asm-generic
kbuild: add "Section mismatch" warning whitelist for powerpc
kbuild: make better section mismatch reports on i386, arm and mips
kbuild: make modpost section warnings clearer
kconfig: search harder for curses library in check-lxdialog.sh
kbuild: include limits.h in sumversion.c for PATH_MAX
powerpc: Fix the MODALIAS generation in modpost for of devices -
First thing mm.h does is including sched.h solely for can_do_mlock() inline
function which has "current" dereference inside. By dealing with can_do_mlock()
mm.h can be detached from sched.h which is good. See below, why.This patch
a) removes unconditional inclusion of sched.h from mm.h
b) makes can_do_mlock() normal function in mm/mlock.c
c) exports can_do_mlock() to not break compilation
d) adds sched.h inclusions back to files that were getting it indirectly.
e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
getting them indirectlyNet result is:
a) mm.h users would get less code to open, read, preprocess, parse, ... if
they don't need sched.h
b) sched.h stops being dependency for significant number of files:
on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
after patch it's only 3744 (-8.3%).Cross-compile tested on
all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
alpha alpha-up
arm
i386 i386-up i386-defconfig i386-allnoconfig
ia64 ia64-up
m68k
mips
parisc parisc-up
powerpc powerpc-up
s390 s390-up
sparc sparc-up
sparc64 sparc64-up
um-x86_64
x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfigas well as my two usual configs.
Signed-off-by: Alexey Dobriyan
Signed-off-by: Linus Torvalds
19 May, 2007
2 commits
-
Use the new __init_refok marker to avoid the
section mismatch warning from slab.cSigned-off-by: Sam Ravnborg
-
modpost had two cases hardcoded for mm/
Shift over to __init_refok and kill the
hardcoded function names in modpost.This has the drawback that the functions
will always be kept no matter configuration.
With previous code the function were placed in
init section if configuration allowed it.Signed-off-by: Sam Ravnborg
17 May, 2007
6 commits
-
Re-introduce rmap verification patches that Hugh removed when he removed
PG_map_lock. PG_map_lock actually isn't needed to synchronise access to
anonymous pages, because PG_locked and PTL together already do.These checks were important in discovering and fixing a rare rmap corruption
in SLES9.Signed-off-by: Nick Piggin
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
__vunmap doesn't seem to be used outside of mm/vmalloc.c, and has
no prototype in any header so let's make it staticSigned-off-by: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently we have a maze of configuration variables that determine the
maximum slab size. Worst of all it seems to vary between SLAB and SLUB.So define a common maximum size for kmalloc. For conveniences sake we use
the maximum size ever supported which is 32 MB. We limit the maximum size
to a lower limit if MAX_ORDER does not allow such large allocations.For many architectures this patch will have the effect of adding large
kmalloc sizes. x86_64 adds 5 new kmalloc sizes. So a small amount of
memory will be needed for these caches (contemporary SLAB has dynamically
sizeable node and cpu structure so the waste is less than in the past)Most architectures will then be able to allocate object with sizes up to
MAX_ORDER. We have had repeated breakage (in fact whenever we doubled the
number of supported processors) on IA64 because one or the other struct
grew beyond what the slab allocators supported. This will avoid future
issues and f.e. avoid fixes for 2k and 4k cpu support.CONFIG_LARGE_ALLOCS is no longer necessary so drop it.
It fixes sparc64 with SLAB.
Signed-off-by: Christoph Lameter
Signed-off-by: "David S. Miller"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Consolidate functionality into the #ifdef section.
Extract tracing into one subroutine.
Move object debug processing into the #ifdef section so that the
code in __slab_alloc and __slab_free becomes minimal.Reduce number of functions we need to provide stubs for in the !SLUB_DEBUG case.
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it.
Signed-off-by: Christoph Lameter
Cc: David Howells
Cc: Jens Axboe
Cc: Steven French
Cc: Michael Halcrow
Cc: OGAWA Hirofumi
Cc: Miklos Szeredi
Cc: Steven Whitehouse
Cc: Roman Zippel
Cc: David Woodhouse
Cc: Dave Kleikamp
Cc: Trond Myklebust
Cc: "J. Bruce Fields"
Cc: Anton Altaparmakov
Cc: Mark Fasheh
Cc: Paul Mackerras
Cc: Christoph Hellwig
Cc: Jan Kara
Cc: David Chinner
Cc: "David S. Miller"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The atomicity when handling flags in SLUB is not necessary since both flags
used by SLUB are not updated in a racy way. Flag updates are either done
during slab creation or destruction or under slab_lock. Some of these flags
do not have the non atomic variants that we need. So define our own.Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds