Eric Lee / smarc-fsl-linux-kernel

27 Sep, 2006

27 commits

930e652a2 [PATCH] NOMMU: Make futexes work under NOMMU conditions ... Browse Code »

Make futexes work under NOMMU conditions.

This can be tested by running this in one shell:

#define SYSERROR(X, Y) \
do { if ((long)(X) == -1L) { perror(Y); exit(1); }} while(0)

int main()
{
int shmid, tmp, *f, n;

shmid = shmget(23, 4, IPC_CREAT|0666);
SYSERROR(shmid, "shmget");

f = shmat(shmid, NULL, 0);
SYSERROR(f, "shmat");

n = *f;
printf("WAIT: %p{%x}\n", f, n);
tmp = futex(f, FUTEX_WAIT, n, NULL, NULL, 0);
SYSERROR(tmp, "futex");
printf("WAITED: %d\n", tmp);

tmp = shmdt(f);
SYSERROR(tmp, "shmdt");

exit(0);
}

And then this in the other shell:

#define SYSERROR(X, Y) \
do { if ((long)(X) == -1L) { perror(Y); exit(1); }} while(0)

int main()
{
int shmid, tmp, *f;

shmid = shmget(23, 4, IPC_CREAT|0666);
SYSERROR(shmid, "shmget");

f = shmat(shmid, NULL, 0);
SYSERROR(f, "shmat");

(*f)++;
printf("WAKE: %p{%x}\n", f, *f);
tmp = futex(f, FUTEX_WAKE, 1, NULL, NULL, 0);
SYSERROR(tmp, "futex");
printf("WOKE: %d\n", tmp);

tmp = shmdt(f);
SYSERROR(tmp, "shmdt");

exit(0);
}

The first program will set up a SYSV IPC SHM segment and wait on a futex in it
for the number at the start to change. The program will increment that number
and wake the first program up. This leads to output of the form:

SHELL 1 SHELL 2
======================= =======================
# /dowait
WAIT: 0xc32ac000{0}
# /dowake
WAKE: 0xc32ac000{1}
WAITED: 0 WOKE: 1

Signed-off-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Howells
2006-09-27 23:26:15 +0800
6fa5f80bc [PATCH] NOMMU: Make mremap() partially work for NOMMU kernels ... Browse Code »

Make mremap() partially work for NOMMU kernels. It may resize a VMA provided
that it doesn't exceed the size of the slab object in which the storage is
allocated that the VMA refers to. Shareable VMAs may not be resized.

Moving VMAs (as permitted by MREMAP_MAYMOVE) is not currently supported.

This patch also makes use of the fact that the VMA list is now ordered to cut
it short when possible.

Signed-off-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Howells
2006-09-27 23:26:14 +0800
3034097a5 [PATCH] NOMMU: Order the per-mm_struct VMA list ... Browse Code »

Order the per-mm_struct VMA list by address so that searching it can be cut
short when the appropriate address has been exceeded.

Signed-off-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Howells
2006-09-27 23:26:14 +0800
d00c7b993 [PATCH] NOMMU: Permit ptrace to ignore non-PROT_WRITE VMAs in NOMMU mode ... Browse Code »

Permit ptrace to modify a section that's non-shared but is marked
unwritable, such as is obtained by mapping the text segment of an ELF-FDPIC
executable binary with into a binary that's being ptraced[*].

[*] Under NOMMU conditions ptrace causes read-only MAP_PRIVATE mmaps to become
totally private copies because if a private mapping was actually shared
then the debugging setting breakpoints in it would potentially crash
other processes.

This is done by using the VM_MAYWRITE flag rather than the VM_WRITE flag
when deciding whether to permit a write.

Without this patch a debugger can't set breakpoints in the mapped text
sections of executables that are mapped read-only private, even if the
mmap() syscall has taken a private copy because PT_PTRACED is set.

In addition, VM_MAYREAD is used instead of VM_READ for similar reasons.

Signed-off-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Howells
2006-09-27 23:26:14 +0800
7b4d5b8b3 [PATCH] NOMMU: Check VMA protections ... Browse Code »

Check the VMA protections in get_user_pages() against what's being asked.

This checks to see that we don't accidentally write on a non-writable VMA or
permit an I/O mapping VMA to be accessed (which may lack page structs).

Signed-off-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Howells
2006-09-27 23:26:14 +0800
910e46da4 [PATCH] Check if start address is in vma region in NOMMU function get_user_pages() ... Browse Code »

In NOMMU arch, if run "cat /proc/self/mem", data from physical address 0
are read. This behavior is different from MMU arch. In IA32, message
"cat: /proc/self/mem: Input/output error" is reported.

This issue is rootcaused by not validate the start address in NOMMU
function get_user_pages(). Following patch solves this issue.

Signed-off-by: Sonic Zhang
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sonic Zhang
2006-09-27 23:26:14 +0800
0159b141d [PATCH] NOMMU: Use find_vma() rather than reimplementing a VMA search ... Browse Code »

Use find_vma() in the NOMMU version of access_process_vm() rather than
reimplementing it.

Signed-off-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Howells
2006-09-27 23:26:14 +0800
0ec76a110 [PATCH] NOMMU: Check that access_process_vm() has a valid target ... Browse Code »

Check that access_process_vm() is accessing a valid mapping in the target
process.

This limits ptrace() accesses and accesses through /proc//maps to only
those regions actually mapped by a program.

Signed-off-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Howells
2006-09-27 23:26:14 +0800
d24afc57d [PATCH] Mark __remove_vm_area() static ... Browse Code »

The function is exported but not used from anywhere else. It's also marked as
"not for driver use" so noone out there should really care.

Signed-off-by: Rolf Eike Beer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rolf Eike Beer
2006-09-27 23:26:13 +0800
ead04089b [PATCH] Fix kerneldoc comments in mm/vmalloc.c ... Browse Code »

The empty line between the short description and the first argument
description causes a section to appear twice in the generated manpage.
Also the short description should really be short: the script can't handle
multiple lines.

Signed-off-by: Rolf Eike Beer
Acked-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rolf Eike Beer
2006-09-27 23:26:13 +0800
423b41d77 [PATCH] mm/page_alloc: use NULL instead of 0 for ptr ... Browse Code »

Use NULL instead of 0 for pointer value, eliminate sparse warnings.

Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2006-09-27 23:26:13 +0800
f4b81804a [PATCH] do_no_pfn() ... Browse Code »

Implement do_no_pfn() for handling mapping of memory without a struct page
backing it. This avoids creating fake page table entries for regions which
are not backed by real memory.

This feature is used by the MSPEC driver and other users, where it is
highly undesirable to have a struct page sitting behind the page (for
instance if the page is accessed in cached mode via the struct page in
parallel to the the driver accessing it uncached, which can result in data
corruption on some architectures, such as ia64).

This version uses specific NOPFN_{SIGBUS,OOM} return values, rather than
expect all negative pfn values would be an error. It also bugs on cow
mappings as this would not work with the VM.

[akpm@osdl.org: micro-optimise]
Signed-off-by: Jes Sorensen
Cc: Hugh Dickins
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jes Sorensen
2006-09-27 23:26:13 +0800
5d2923436 [PATCH] zone_statistics: Use hot node instead of cold zone_pgdat ... Browse Code »

Now that we have the node in the hot zone of struct zone we can avoid
accessing zone_pgdat in zone_statistics.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-27 23:26:13 +0800
66a550308 [PATCH] Do not allocate pagesets for unpopulated zones. ... Browse Code »

We do not need to allocate pagesets for unpopulated zones.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-27 23:26:13 +0800
d5f541ed6 [PATCH] Add node to zone for the NUMA case ... Browse Code »

Add the node in order to optimize zone_to_nid.

Signed-off-by: Christoph Lameter
Acked-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-27 23:26:13 +0800
765c4507a [PATCH] GFP_THISNODE for the slab allocator ... Browse Code »

This patch insures that the slab node lists in the NUMA case only contain
slabs that belong to that specific node. All slab allocations use
GFP_THISNODE when calling into the page allocator. If an allocation fails
then we fall back in the slab allocator according to the zonelists appropriate
for a certain context.

This allows a replication of the behavior of alloc_pages and alloc_pages node
in the slab layer.

Currently allocations requested from the page allocator may be redirected via
cpusets to other nodes. This results in remote pages on nodelists and that in
turn results in interrupt latency issues during cache draining. Plus the slab
is handing out memory as local when it is really remote.

Fallback for slab memory allocations will occur within the slab allocator and
not in the page allocator. This is necessary in order to be able to use the
existing pools of objects on the nodes that we fall back to before adding more
pages to a slab.

The fallback function insures that the nodes we fall back to obey cpuset
restrictions of the current context. We do not allocate objects from outside
of the current cpuset context like before.

Note that the implementation of locality constraints within the slab allocator
requires importing logic from the page allocator. This is a mischmash that is
not that great. Other allocators (uncached allocator, vmalloc, huge pages)
face similar problems and have similar minimal reimplementations of the basic
fallback logic of the page allocator. There is another way of implementing a
slab by avoiding per node lists (see modular slab) but this wont work within
the existing slab.

V1->V2:
- Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA
- Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another
#ifdef

[akpm@osdl.org: build fix]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-27 23:26:12 +0800
08e0f6a97 [PATCH] Add NUMA_BUILD definition in kernel.h to avoid #ifdef CONFIG_NUMA ... Browse Code »

The NUMA_BUILD constant is always available and will be set to 1 on
NUMA_BUILDs. That way checks valid only under CONFIG_NUMA can easily be done
without #ifdef CONFIG_NUMA

F.e.

if (NUMA_BUILD && ) {
...
}

[akpm: not a thing we'd normally do, but CONFIG_NUMA is special: it is
causing ifdef explosion in core kernel, so let's see if this is a comfortable
way in whcih to control that]

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-27 23:26:12 +0800
c72419138 [PATCH] Condense output of show_free_areas() ... Browse Code »

On larger systems, the amount of output dumped on the console when you do
SysRq-M is beyond insane. This patch is trying to reduce it somewhat as
even with the smaller NUMA systems that have hit the desktop this seems to
be a fair thing to do.

The philosophy I have taken is as follows:
1) If a zone is empty, don't tell, we don't need yet another line
telling us so. The information is available since one can look up
the fact how many zones were initialized in the first place.
2) Put as much information on a line is possible, if it can be done
in one line, rahter than two, then do it in one. I tried to format
the temperature stuff for easy reading.

Change show_free_areas() to not print lines for empty zones. If no zone
output is printed, the zone is empty. This reduces the number of lines
dumped to the console in sysrq on a large system by several thousand lines.

Change the zone temperature printouts to use one line per CPU instead of
two lines (one hot, one cold). On a 1024 CPU, 1024 node system, this
reduces the console output by over a million lines of output.

While this is a bigger problem on large NUMA systems, it is also applicable
to smaller desktop sized and mid range NUMA systems.

Old format:

Mem-info:
Node 0 DMA per-cpu:
cpu 0 hot: high 42, batch 7 used:24
cpu 0 cold: high 14, batch 3 used:1
cpu 1 hot: high 42, batch 7 used:34
cpu 1 cold: high 14, batch 3 used:0
cpu 2 hot: high 42, batch 7 used:0
cpu 2 cold: high 14, batch 3 used:0
cpu 3 hot: high 42, batch 7 used:0
cpu 3 cold: high 14, batch 3 used:0
cpu 4 hot: high 42, batch 7 used:0
cpu 4 cold: high 14, batch 3 used:0
cpu 5 hot: high 42, batch 7 used:0
cpu 5 cold: high 14, batch 3 used:0
cpu 6 hot: high 42, batch 7 used:0
cpu 6 cold: high 14, batch 3 used:0
cpu 7 hot: high 42, batch 7 used:0
cpu 7 cold: high 14, batch 3 used:0
Node 0 DMA32 per-cpu: empty
Node 0 Normal per-cpu: empty
Node 0 HighMem per-cpu: empty
Node 1 DMA per-cpu:
[snip]
Free pages: 5410688kB (0kB HighMem)
Active:9536 inactive:4261 dirty:6 writeback:0 unstable:0 free:338168 slab:1931 mapped:1900 pagetables:208
Node 0 DMA free:1676304kB min:3264kB low:4080kB high:4896kB active:128048kB inactive:61568kB present:1970880kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 HighMem free:0kB min:512kB low:512kB high:512kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 1 DMA free:1951728kB min:3280kB low:4096kB high:4912kB active:5632kB inactive:1504kB present:1982464kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
....

New format:

Mem-info:
Node 0 DMA per-cpu:
CPU 0: Hot: hi: 42, btch: 7 usd: 41 Cold: hi: 14, btch: 3 usd: 2
CPU 1: Hot: hi: 42, btch: 7 usd: 40 Cold: hi: 14, btch: 3 usd: 1
CPU 2: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0
CPU 3: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0
CPU 4: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0
CPU 5: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0
CPU 6: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0
CPU 7: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0
Node 1 DMA per-cpu:
[snip]
Free pages: 5411088kB (0kB HighMem)
Active:9558 inactive:4233 dirty:6 writeback:0 unstable:0 free:338193 slab:1942 mapped:1918 pagetables:208
Node 0 DMA free:1677648kB min:3264kB low:4080kB high:4896kB active:129296kB inactive:58864kB present:1970880kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 1 DMA free:1948448kB min:3280kB low:4096kB high:4912kB active:6864kB inactive:3536kB present:1982464kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0

Signed-off-by: Jes Sorensen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jes Sorensen
2006-09-27 23:26:12 +0800
de3083ec3 [PATCH] slab: fix kmalloc_node applying memory policies if nodeid == numa_node_id() ... Browse Code »

kmalloc_node() falls back to ___cache_alloc() under certain conditions and
at that point memory policies may be applied redirecting the allocation
away from the current node. Therefore kmalloc_node(...,numa_node_id()) or
kmalloc_node(...,-1) may not return memory from the local node.

Fix this by doing the policy check in __cache_alloc() instead of
____cache_alloc().

This version here is a cleanup of Kiran's patch.

- Tested on ia64.
- Extra material removed.
- Consolidate the exit path if alternate_node_alloc() returned an object.

[akpm@osdl.org: warning fix]
Signed-off-by: Alok N Kataria
Signed-off-by: Ravikiran Thirumalai
Signed-off-by: Shai Fultheim
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-27 23:26:12 +0800
0fd0e6b05 [PATCH] page invalidation cleanup ... Browse Code »

Clean up the invalidate code, and use a common function to safely remove
the page from pagecache.

Signed-off-by: Nick Piggin
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2006-09-27 23:26:12 +0800
e129b5c23 [PATCH] vm: add per-zone writeout counter ... Browse Code »

The VM is supposed to minimise the number of pages which get written off the
LRU (for IO scheduling efficiency, and for high reclaim-success rates). But
we don't actually have a clear way of showing how true this is.

So add `nr_vmscan_write' to /proc/vmstat and /proc/zoneinfo - the number of
pages which have been written by the vm scanner in this zone and globally.

Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2006-09-27 23:26:12 +0800
fb01439c5 [PATCH] Allow an arch to expand node boundaries ... Browse Code »

Arch-independent zone-sizing determines the size of a node
(pgdat->node_spanned_pages) based on the physical memory that was
registered by the architecture. However, when
CONFIG_MEMORY_HOTPLUG_RESERVE is set, the architecture expects that the
spanned_pages will be much larger and that mem_map will be allocated that
is used lated on memory hot-add.

This patch allows an architecture that sets CONFIG_MEMORY_HOTPLUG_RESERVE
to call push_node_boundaries() which will set the node beginning and end to
at *least* the requested boundary.

Cc: Dave Hansen
Cc: Andy Whitcroft
Cc: Andi Kleen
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: "Keith Mannthey"
Cc: "Luck, Tony"
Cc: KAMEZAWA Hiroyuki
Cc: Yasunori Goto
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2006-09-27 23:26:12 +0800
9c7cd6877 [PATCH] Account for holes that are outside the range of physical memory ... Browse Code »

absent_pages_in_range() made the assumption that users of the API would not
care about holes beyound the end of physical memory. This was not the
case. This patch will account for ranges outside of physical memory as
holes correctly.

Cc: Dave Hansen
Cc: Andy Whitcroft
Cc: Andi Kleen
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: "Keith Mannthey"
Cc: "Luck, Tony"
Cc: KAMEZAWA Hiroyuki
Cc: Yasunori Goto
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2006-09-27 23:26:11 +0800
0e0b864e0 [PATCH] Account for memmap and optionally the kernel image as holes ... Browse Code »

The x86_64 code accounted for memmap and some portions of the the DMA zone as
holes. This was because those areas would never be reclaimed and accounting
for them as memory affects min watermarks. This patch will account for the
memmap as a memory hole. Architectures may optionally use set_dma_reserve()
if they wish to account for a portion of memory in ZONE_DMA as a hole.

Signed-off-by: Mel Gorman
Cc: Dave Hansen
Cc: Andy Whitcroft
Cc: Andi Kleen
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: "Keith Mannthey"
Cc: "Luck, Tony"
Cc: KAMEZAWA Hiroyuki
Cc: Yasunori Goto
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2006-09-27 23:26:11 +0800
c713216de [PATCH] Introduce mechanism for registering active regions of memory ... Browse Code »

At a basic level, architectures define structures to record where active
ranges of page frames are located. Once located, the code to calculate zone
sizes and holes in each architecture is very similar. Some of this zone and
hole sizing code is difficult to read for no good reason. This set of patches
eliminates the similar-looking architecture-specific code.

The patches introduce a mechanism where architectures register where the
active ranges of page frames are with add_active_range(). When all areas have
been discovered, free_area_init_nodes() is called to initialise the pgdat and
zones. The zone sizes and holes are then calculated in an architecture
independent manner.

Patch 1 introduces the mechanism for registering and initialising PFN ranges
Patch 2 changes ppc to use the mechanism - 139 arch-specific LOC removed
Patch 3 changes x86 to use the mechanism - 136 arch-specific LOC removed
Patch 4 changes x86_64 to use the mechanism - 74 arch-specific LOC removed
Patch 5 changes ia64 to use the mechanism - 52 arch-specific LOC removed
Patch 6 accounts for mem_map as a memory hole as the pages are not reclaimable.
It adjusts the watermarks slightly

Tony Luck has successfully tested for ia64 on Itanium with tiger_defconfig,
gensparse_defconfig and defconfig. Bob Picco has also tested and debugged on
IA64. Jack Steiner successfully boot tested on a mammoth SGI IA64-based
machine. These were on patches against 2.6.17-rc1 and release 3 of these
patches but there have been no ia64-changes since release 3.

There are differences in the zone sizes for x86_64 as the arch-specific code
for x86_64 accounts the kernel image and the starting mem_maps as memory holes
but the architecture-independent code accounts the memory as present.

The big benefit of this set of patches is a sizable reduction of
architecture-specific code, some of which is very hairy. There should be a
greater reduction when other architectures use the same mechanisms for zone
and hole sizing but I lack the hardware to test on.

Additional credit;
Dave Hansen for the initial suggestion and comments on early patches
Andy Whitcroft for reviewing early versions and catching numerous
errors
Tony Luck for testing and debugging on IA64
Bob Picco for fixing bugs related to pfn registration, reviewing a
number of patch revisions, providing a number of suggestions
on future direction and testing heavily
Jack Steiner and Robin Holt for testing on IA64 and clarifying
issues related to memory holes
Yasunori for testing on IA64
Andi Kleen for reviewing and feeding back about x86_64
Christian Kujau for providing valuable information related to ACPI
problems on x86_64 and testing potential fixes

This patch:

Define the structure to represent an active range of page frames within a node
in an architecture independent manner. Architectures are expected to register
active ranges of PFNs using add_active_range(nid, start_pfn, end_pfn) and call
free_area_init_nodes() passing the PFNs of the end of each zone.

Signed-off-by: Mel Gorman
Signed-off-by: Bob Picco
Cc: Dave Hansen
Cc: Andy Whitcroft
Cc: Andi Kleen
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: "Keith Mannthey"
Cc: "Luck, Tony"
Cc: KAMEZAWA Hiroyuki
Cc: Yasunori Goto
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2006-09-27 23:26:11 +0800
133d205a1 [PATCH] Make kmem_cache_destroy() return void ... Browse Code »

un-, de-, -free, -destroy, -exit, etc functions should in general return
void. Also,

There is very little, say, filesystem driver code can do upon failed
kmem_cache_destroy(). If it will be decided to BUG in this case, BUG
should be put in generic code, instead.

Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2006-09-27 23:26:11 +0800
1a1d92c10 [PATCH] Really ignore kmem_cache_destroy return value ... Browse Code »

* Rougly half of callers already do it by not checking return value
* Code in drivers/acpi/osl.c does the following to be sure:

(void)kmem_cache_destroy(cache);

* Those who check it printk something, however, slab_error already printed
the name of failed cache.
* XFS BUGs on failed kmem_cache_destroy which is not the decision
low-level filesystem driver should make. Converted to ignore.

Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2006-09-27 23:26:10 +0800

26 Sep, 2006

13 commits

f623f0db8 [PATCH] swsusp: Fix mark_free_pages ... Browse Code »

Clean up mm/page_alloc.c#mark_free_pages() and make it avoid clearing
PageNosaveFree for PageNosave pages. This allows us to get rid of an ugly
hack in kernel/power/snapshot.c#copy_data_pages().

Additionally, the page-copying loop in copy_data_pages() is moved to an
inline function.

Signed-off-by: Rafael J. Wysocki
Cc: Pavel Machek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rafael J. Wysocki
2006-09-26 23:48:59 +0800
546e0d271 [PATCH] swsusp: read speedup ... Browse Code »

Implement async reads for swsusp resuming.

Crufty old PIII testbox:
15.7 MB/s -> 20.3 MB/s

Sony Vaio:
14.6 MB/s -> 33.3 MB/s

I didn't implement the post-resume bio_set_pages_dirty(). I don't really
understand why resume needs to run set_page_dirty() against these pages.

It might be a worry that this code modifies PG_Uptodate, PG_Error and
PG_Locked against the image pages. Can this possibly affect the resumed-into
kernel? Hopefully not, if we're atomically restoring its mem_map?

Cc: Pavel Machek
Cc: "Rafael J. Wysocki"
Cc: Jens Axboe
Cc: Laurent Riffard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2006-09-26 23:48:58 +0800
ab9541603 [PATCH] swsusp: write speedup ... Browse Code »

Switch the swsusp writeout code from 4k-at-a-time to 4MB-at-a-time.

Crufty old PIII testbox:
12.9 MB/s -> 20.9 MB/s

Sony Vaio:
14.7 MB/s -> 26.5 MB/s

The implementation is crude. A better one would use larger BIOs, but wouldn't
gain any performance.

The memcpys will be mostly pipelined with the IO and basically come for free.

The ENOMEM path has not been tested. It should be.

Cc: Pavel Machek
Cc: "Rafael J. Wysocki"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2006-09-26 23:48:58 +0800
89fa30242 [PATCH] NUMA: Add zone_to_nid function ... Browse Code »

There are many places where we need to determine the node of a zone.
Currently we use a difficult to read sequence of pointer dereferencing.
Put that into an inline function and use throughout VM. Maybe we can find
a way to optimize the lookup in the future.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:52 +0800
4415cc8df [PATCH] Hugepages: Use page_to_nid rather than traversing zone pointers ... Browse Code »

I found two location in hugetlb.c where we chase pointer instead of using
page_to_nid(). Page_to_nid is more effective and can get the node directly
from page flags.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:52 +0800
5a291b98b [PATCH] oom-kill: update comments to reflect current code ... Browse Code »

Update the comments for __oom_kill_task() to reflect the code changes.

Signed-off-by: Ram Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ram Gupta
2006-09-26 23:48:52 +0800
83e33a471 [PATCH] zone reclaim with slab: avoid unecessary off node allocations ... Browse Code »

Minor performance fix.

If we reclaimed enough slab pages from a zone then we can avoid going off
node with the current allocation. Take care of updating nr_reclaimed when
reclaiming from the slab.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:52 +0800
0ff38490c [PATCH] zone_reclaim: dynamic slab reclaim ... Browse Code »

Currently one can enable slab reclaim by setting an explicit option in
/proc/sys/vm/zone_reclaim_mode. Slab reclaim is then used as a final
option if the freeing of unmapped file backed pages is not enough to free
enough pages to allow a local allocation.

However, that means that the slab can grow excessively and that most memory
of a node may be used by slabs. We have had a case where a machine with
46GB of memory was using 40-42GB for slab. Zone reclaim was effective in
dealing with pagecache pages. However, slab reclaim was only done during
global reclaim (which is a bit rare on NUMA systems).

This patch implements slab reclaim during zone reclaim. Zone reclaim
occurs if there is a danger of an off node allocation. At that point we

1. Shrink the per node page cache if the number of pagecache
pages is more than min_unmapped_ratio percent of pages in a zone.

2. Shrink the slab cache if the number of the nodes reclaimable slab pages
(patch depends on earlier one that implements that counter)
are more than min_slab_ratio (a new /proc/sys/vm tunable).

The shrinking of the slab cache is a bit problematic since it is not node
specific. So we simply calculate what point in the slab we want to reach
(current per node slab use minus the number of pages that neeed to be
allocated) and then repeately run the global reclaim until that is
unsuccessful or we have reached the limit. I hope we will have zone based
slab reclaim at some point which will make that easier.

The default for the min_slab_ratio is 5%

Also remove the slab option from /proc/sys/vm/zone_reclaim_mode.

[akpm@osdl.org: cleanups]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:51 +0800
972d1a7b1 [PATCH] ZVC: Support NR_SLAB_RECLAIMABLE / NR_SLAB_UNRECLAIMABLE ... Browse Code »

Remove the atomic counter for slab_reclaim_pages and replace the counter
and NR_SLAB with two ZVC counter that account for unreclaimable and
reclaimable slab pages: NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE.

Change the check in vmscan.c to refer to to NR_SLAB_RECLAIMABLE. The
intend seems to be to check for slab pages that could be freed.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:51 +0800
8417bba4b [PATCH] Replace min_unmapped_ratio by min_unmapped_pages in struct zone ... Browse Code »

*_pages is a better description of the role of the variable.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:51 +0800
d00bcc98d [PATCH] Extract the allocpercpu functions from the slab allocator ... Browse Code »

The allocpercpu functions __alloc_percpu and __free_percpu() are heavily
using the slab allocator. However, they are conceptually slab. This also
simplifies SLOB (at this point slob may be broken in mm. This should fix
it).

Signed-off-by: Christoph Lameter
Cc: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:51 +0800
39bbcb8f8 [PATCH] mm: do not check unpopulated zones for draining and counter updates ... Browse Code »

If a zone is unpopulated then we do not need to check for pages that are to
be drained and also not for vm counters that may need to be updated.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:51 +0800
006d22d9b [PATCH] Optimize free_one_page ... Browse Code »

Free one_page currently adds the page to a fake list and calls
free_page_bulk. Fee_page_bulk takes it off again and then calles
__free_one_page.

Make free_one_page go directly to __free_one_page. Saves list on / off and
a temporary list in free_one_page for higher ordered pages.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:51 +0800