Eric Lee / smarc-fsl-linux-kernel

02 Mar, 2007

5 commits

232ea4d69 [PATCH] throttle_vm_writeout(): don't loop on GFP_NOFS and GFP_NOIO allocations ... Browse Code »

throttle_vm_writeout() is designed to wait for the dirty levels to subside.
But if the caller holds IO or FS locks, we might be holding up that writeout.

So change it to take a single nap to give other devices a chance to clean some
memory, then return.

Cc: Nick Piggin
Cc: OGAWA Hirofumi
Cc: Kumar Gala
Cc: Pete Zaitcev
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2007-03-02 06:53:38 +0800
d1af65d13 [PATCH] Bug in MM_RB debugging ... Browse Code »

The code is seemingly trying to make sure that rb_next() brings us to
successive increasing vma entries.

But the two variables, prev and pend, used to perform these checks, are
never advanced.

Signed-off-by: David S. Miller
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Miller
2007-03-02 06:53:38 +0800
5409bae07 [PATCH] Rename PG_checked to PG_owner_priv_1 ... Browse Code »

Rename PG_checked to PG_owner_priv_1 to reflect its availablilty as a
private flag for use by the owner/allocator of the page. In the case of
pagecache pages (which might be considered to be owned by the mm),
filesystems may use the flag.

Signed-off-by: Jeremy Fitzhardinge
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-03-02 06:53:37 +0800
05fb6bf0b [PATCH] kernel-doc fixes for 2.6.20-git15 (non-drivers) ... Browse Code »

Fix kernel-doc warnings in 2.6.20-git15 (lib/, mm/, kernel/, include/).

Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2007-03-02 06:53:37 +0800
9b83a6a85 [PATCH] mm/{,tiny-}shmem.c cleanups ... Browse Code »

shmem_{nopage,mmap} are no longer used in ipc/shm.c

Signed-off-by: Adrian Bunk
Cc: "Eric W. Biederman"
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2007-03-02 06:53:35 +0800

21 Feb, 2007

3 commits

8ef828668 [PATCH] slab: reduce size of alien cache to cover only possible nodes ... Browse Code »

The alien cache is a per cpu per node array allocated for every slab on the
system. Currently we size this array for all nodes that the kernel does
support. For IA64 this is 1024 nodes. So we allocate an array with 1024
objects even if we only boot a system with 4 nodes.

This patch uses "nr_node_ids" to determine the number of possible nodes
supported by a hardware configuration and only allocates an alien cache
sized for possible nodes.

The initialization of nr_node_ids occurred too late relative to the bootstrap
of the slab allocator and so I moved the setup_nr_node_ids() into
free_area_init_nodes().

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-21 09:10:13 +0800
74c7aa8b8 [PATCH] Replace highest_possible_node_id() with nr_node_ids ... Browse Code »

highest_possible_node_id() is currently used to calculate the last possible
node idso that the network subsystem can figure out how to size per node
arrays.

I think having the ability to determine the maximum amount of nodes in a
system at runtime is useful but then we should name this entry
correspondingly, it should return the number of node_ids, and the the value
needs to be setup only once on bootup. The node_possible_map does not
change after bootup.

This patch introduces nr_node_ids and replaces the use of
highest_possible_node_id(). nr_node_ids is calculated on bootup when the
page allocators pagesets are initialized.

[deweerdt@free.fr: fix oops]
Signed-off-by: Christoph Lameter
Cc: Neil Brown
Cc: Trond Myklebust
Signed-off-by: Frederik Deweerdt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-21 09:10:13 +0800
8af5e2eb3 [PATCH] fix mempolicy's check on a system with memory-less-node ... Browse Code »

bind_zonelist() can create zero-length zonelist if there is a
memory-less-node. This patch checks the length of zonelist. If length is
0, returns -EINVAL.

tested on ia64/NUMA with memory-less-node.

Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Andi Kleen
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2007-02-21 09:10:13 +0800

17 Feb, 2007

1 commit

29dbb3fc8 [PATCH] knfsd: stop NFSD writes from being broken into lots of little writes to filesystem ... Browse Code »

When NFSD receives a write request, the data is typically in a number of
1448 byte segments and writev is used to collect them together.

Unfortunately, generic_file_buffered_write passes these to the filesystem
one at a time, so an e.g. 32K over-write becomes a series of partial-page
writes to each page, causing the filesystem to have to pre-read those pages
- wasted effort.

generic_file_buffered_write handles one segment of the vector at a time as
it has to pre-fault in each segment to avoid deadlocks. When writing from
kernel-space (and nfsd does) this is not an issue, so
generic_file_buffered_write does not need to break and iovec from nfsd into
little pieces.

This patch avoids the splitting when get_fs is KERNEL_DS as it is
from NFSd.

This issue was introduced by commit 6527c2bdf1f833cc18e8f42bd97973d583e4aa83

Acked-by: Nick Piggin
Cc: Norman Weathers
Cc: Vladimir V. Saveliev
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

NeilBrown
2007-02-17 00:14:01 +0800

16 Feb, 2007

3 commits

e0a04cffa [PATCH] mincore: vma crossing fix ... Browse Code »

My mincore also forgot about crossing vmas.

Signed-off-by: Nick Piggin
Signed-off-by: Linus Torvalds

Nick Piggin
2007-02-16 01:57:03 +0800
4a76ef036 [PATCH] mincore: fill in results properly ... Browse Code »

Paper bag time. Thanks to Randy for noticing that I didn't actually assign
'present' to anything.

Unfortunately my original patch passed the few simple test cases I gave it,
purely by coincidence.

Signed-off-by: Nick Piggin
Signed-off-by: Linus Torvalds

Nick Piggin
2007-02-16 01:57:03 +0800
30fcffed8 [PATCH] mincore: CONFIG_SWAP=n fix ... Browse Code »

Fix mincore-anon patch to compile with CONFIG_SWAP=n

Signed-off-by: Nick Piggin
Signed-off-by: Linus Torvalds

Nick Piggin
2007-02-16 01:57:03 +0800

13 Feb, 2007

4 commits

92e1d5be9 [PATCH] mark struct inode_operations const 2 ... Browse Code »

Many struct inode_operations in the kernel can be "const". Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data. In addition it'll catch accidental writes at compile time to
these shared resources.

Signed-off-by: Arjan van de Ven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arjan van de Ven
2007-02-13 01:48:46 +0800
42da9cbd3 [PATCH] mm: mincore anon ... Browse Code »

Make mincore work for anon mappings, nonlinear, and migration entries.
Based on patch from Linus Torvalds .

Signed-off-by: Nick Piggin
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-02-13 01:48:27 +0800
22cd25ed3 [PATCH] Add NOPFN_REFAULT result from vm_ops->nopfn() ... Browse Code »

Add a NOPFN_REFAULT return code for vm_ops->nopfn() equivalent to
NOPAGE_REFAULT for vmops->nopage() indicating that the handler requests a
re-execution of the faulting instruction

Signed-off-by: Benjamin Herrenschmidt
Cc: Arnd Bergmann
Cc: Hugh Dickins
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Benjamin Herrenschmidt
2007-02-13 01:48:27 +0800
e0dc0d8f4 [PATCH] add vm_insert_pfn() ... Browse Code »

Add a vm_insert_pfn helper, so that ->fault handlers can have nopfn
functionality by installing their own pte and returning NULL.

Signed-off-by: Nick Piggin
Signed-off-by: Benjamin Herrenschmidt
Cc: Arnd Bergmann
Cc: Hugh Dickins
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-02-13 01:48:27 +0800

12 Feb, 2007

24 commits

aa0f03037 [PATCH] Change constant zero to NOTIFY_DONE in ratelimit_handler() ... Browse Code »

Change a hard-coded constant 0 to the symbolic equivalent NOTIFY_DONE in
the ratelimit_handler() CPU notifier handler function.

Signed-off-by: Paul E. McKenney
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul E. McKenney
2007-02-12 03:18:07 +0800
72fd4a35a [PATCH] Numerous fixes to kernel-doc info in source files. ... Browse Code »

A variety of (mostly) innocuous fixes to the embedded kernel-doc content in
source files, including:

* make multi-line initial descriptions single line
* denote some function names, constants and structs as such
* change erroneous opening '/*' to '/**' in a few places
* reword some text for clarity

Signed-off-by: Robert P. J. Day
Cc: "Randy.Dunlap"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robert P. J. Day
2007-02-12 02:51:32 +0800
fc0ecff69 [PATCH] remove invalidate_inode_pages() ... Browse Code »

Convert all calls to invalidate_inode_pages() into open-coded calls to
invalidate_mapping_pages().

Leave the invalidate_inode_pages() wrapper in place for now, marked as
deprecated.

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2007-02-12 02:51:31 +0800
54bc48552 [PATCH] Export invalidate_mapping_pages() to modules ... Browse Code »

It makes no sense to me to export invalidate_inode_pages() and not
invalidate_mapping_pages() and I actually need invalidate_mapping_pages()
because of its range specification ability...

akpm: also remove the export of invalidate_inode_pages() by making it an
inlined wrapper.

Signed-off-by: Anton Altaparmakov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anton Altaparmakov
2007-02-12 02:51:30 +0800
898552c9d [PATCH] lockdep: also check for freed locks in kmem_cache_free() ... Browse Code »

kmem_cache_free() was missing the check for freeing held locks.

Signed-off-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ingo Molnar
2007-02-12 02:51:26 +0800
daa88c8d2 [PATCH] do not disturb page referenced state when unmapping memory range ... Browse Code »

When kernel unmaps an address range, it needs to transfer PTE state into
page struct. Currently, kernel transfer access bit via
mark_page_accessed(). The call to mark_page_accessed in the unmap path
doesn't look logically correct.

At unmap time, calling mark_page_accessed will causes page LRU state to be
bumped up one step closer to more recently used state. It is causing quite
a bit headache in a scenario when a process creates a shmem segment, touch
a whole bunch of pages, then unmaps it. The unmapping takes a long time
because mark_page_accessed() will start moving pages from inactive to
active list.

I'm not too much concerned with moving the page from one list to another in
LRU. Sooner or later it might be moved because of multiple mappings from
various processes. But it just doesn't look logical that when user asks a
range to be unmapped, it's his intention that the process is no longer
interested in these pages. Moving those pages to active list (or bumping
up a state towards more active) seems to be an over reaction. It also
prolongs unmapping latency which is the core issue I'm trying to solve.

As suggested by Peter, we should still preserve the info on pte young
pages, but not more.

Signed-off-by: Peter Zijlstra
Acked-by: Ken Chen
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ken Chen
2007-02-12 02:51:19 +0800
767193253 [PATCH] simplify shmem_aops.set_page_dirty() method ... Browse Code »

shmem backed file does not have page writeback, nor it participates in
backing device's dirty or writeback accounting. So using generic
__set_page_dirty_nobuffers() for its .set_page_dirty aops method is a bit
overkill. It unnecessarily prolongs shm unmap latency.

For example, on a densely populated large shm segment (sevearl GBs), the
unmapping operation becomes painfully long. Because at unmap, kernel
transfers dirty bit in PTE into page struct and to the radix tree tag. The
operation of tagging the radix tree is particularly expensive because it
has to traverse the tree from the root to the leaf node on every dirty
page. What's bothering is that radix tree tag is used for page write back.
However, shmem is memory backed and there is no page write back for such
file system. And in the end, we spend all that time tagging radix tree and
none of that fancy tagging will be used. So let's simplify it by introduce
a new aops __set_page_dirty_no_writeback and this will speed up shm unmap.

Signed-off-by: Ken Chen
Cc: Peter Zijlstra
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ken Chen
2007-02-12 02:51:19 +0800
5ac6da669 [PATCH] Set CONFIG_ZONE_DMA for arches with GENERIC_ISA_DMA ... Browse Code »

As Andi pointed out: CONFIG_GENERIC_ISA_DMA only disables the ISA DMA
channel management. Other functionality may still expect GFP_DMA to
provide memory below 16M. So we need to make sure that CONFIG_ZONE_DMA is
set independent of CONFIG_GENERIC_ISA_DMA. Undo the modifications to
mm/Kconfig where we made ZONE_DMA dependent on GENERIC_ISA_DMA and set
theses explicitly in each arches Kconfig.

Reviews must occur for each arch in order to determine if ZONE_DMA can be
switched off. It can only be switched off if we know that all devices
supported by a platform are capable of performing DMA transfers to all of
memory (Some arches already support this: uml, avr32, sh sh64, parisc and
IA64/Altix).

In order to switch ZONE_DMA off conditionally, one would have to establish
a scheme by which one can assure that no drivers are enabled that are only
capable of doing I/O to a part of memory, or one needs to provide an
alternate means of performing an allocation from a specific range of memory
(like provided by alloc_pages_range()) and insure that all drivers use that
call. In that case the arches alloc_dma_coherent() may need to be modified
to call alloc_pages_range() instead of relying on GFP_DMA.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:19 +0800
4b51d6698 [PATCH] optional ZONE_DMA: optional ZONE_DMA in the VM ... Browse Code »

Make ZONE_DMA optional in core code.

- ifdef all code for ZONE_DMA and related definitions following the example
for ZONE_DMA32 and ZONE_HIGHMEM.

- Without ZONE_DMA, ZONE_HIGHMEM and ZONE_DMA32 we get to a ZONES_SHIFT of
0.

- Modify the VM statistics to work correctly without a DMA zone.

- Modify slab to not create DMA slabs if there is no ZONE_DMA.

[akpm@osdl.org: cleanup]
[jdike@addtoit.com: build fix]
[apw@shadowen.org: Simplify calculation of the number of bits we need for ZONES_SHIFT]
Signed-off-by: Christoph Lameter
Cc: Andi Kleen
Cc: "Luck, Tony"
Cc: Kyle McMartin
Cc: Matthew Wilcox
Cc: James Bottomley
Cc: Paul Mundt
Signed-off-by: Andy Whitcroft
Signed-off-by: Jeff Dike
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:18 +0800
66701b149 [PATCH] optional ZONE_DMA: introduce CONFIG_ZONE_DMA ... Browse Code »

This patch simply defines CONFIG_ZONE_DMA for all arches. We later do special
things with CONFIG_ZONE_DMA after the VM and an arch are prepared to work
without ZONE_DMA.

CONFIG_ZONE_DMA can be defined in two ways depending on how an architecture
handles ISA DMA.

First if CONFIG_GENERIC_ISA_DMA is set by the arch then we know that the arch
needs ZONE_DMA because ISA DMA devices are supported. We can catch this in
mm/Kconfig and do not need to modify arch code.

Second, arches may use ZONE_DMA in an unknown way. We set CONFIG_ZONE_DMA for
all arches that do not set CONFIG_GENERIC_ISA_DMA in order to insure backwards
compatibility. The arches may later undefine ZONE_DMA if their arch code has
been verified to not depend on ZONE_DMA.

Signed-off-by: Christoph Lameter
Cc: Andi Kleen
Cc: "Luck, Tony"
Cc: Kyle McMartin
Cc: Matthew Wilcox
Cc: James Bottomley
Cc: Paul Mundt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:18 +0800
6267276f3 [PATCH] optional ZONE_DMA: deal with cases of ZONE_DMA meaning the first zone ... Browse Code »

This patchset follows up on the earlier work in Andrew's tree to reduce the
number of zones. The patches allow to go to a minimum of 2 zones. This one
allows also to make ZONE_DMA optional and therefore the number of zones can be
reduced to one.

ZONE_DMA is usually used for ISA DMA devices. There are a number of reasons
why we would not want to have ZONE_DMA

1. Some arches do not need ZONE_DMA at all.

2. With the advent of IOMMUs DMA zones are no longer needed.
The necessity of DMA zones may drastically be reduced
in the future. This patchset allows a compilation of
a kernel without that overhead.

3. Devices that require ISA DMA get rare these days. All
my systems do not have any need for ISA DMA.

4. The presence of an additional zone unecessarily complicates
VM operations because it must be scanned and balancing
logic must operate on its.

5. With only ZONE_NORMAL one can reach the situation where
we have only one zone. This will allow the unrolling of many
loops in the VM and allows the optimization of varous
code paths in the VM.

6. Having only a single zone in a NUMA system results in a
1-1 correspondence between nodes and zones. Various additional
optimizations to critical VM paths become possible.

Many systems today can operate just fine with a single zone. If you look at
what is in ZONE_DMA then one usually sees that nothing uses it. The DMA slabs
are empty (Some arches use ZONE_DMA instead of ZONE_NORMAL, then ZONE_NORMAL
will be empty instead).

On all of my systems (i386, x86_64, ia64) ZONE_DMA is completely empty. Why
constantly look at an empty zone in /proc/zoneinfo and empty slab in
/proc/slabinfo? Non i386 also frequently have no need for ZONE_DMA and zones
stay empty.

The patchset was tested on i386 (UP / SMP), x86_64 (UP, NUMA) and ia64 (NUMA).

The RFC posted earlier (see
http://marc.theaimsgroup.com/?l=linux-kernel&m=115231723513008&w=2) had lots
of #ifdefs in them. An effort has been made to minize the number of #ifdefs
and make this as compact as possible. The job was made much easier by the
ongoing efforts of others to extract common arch specific functionality.

I have been running this for awhile now on my desktop and finally Linux is
using all my available RAM instead of leaving the 16MB in ZONE_DMA untouched:

christoph@pentium940:~$ cat /proc/zoneinfo
Node 0, zone Normal
pages free 4435
min 1448
low 1810
high 2172
active 241786
inactive 210170
scanned 0 (a: 0 i: 0)
spanned 524224
present 524224
nr_anon_pages 61680
nr_mapped 14271
nr_file_pages 390264
nr_slab_reclaimable 27564
nr_slab_unreclaimable 1793
nr_page_table_pages 449
nr_dirty 39
nr_writeback 0
nr_unstable 0
nr_bounce 0
cpu: 0 pcp: 0
count: 156
high: 186
batch: 31
cpu: 0 pcp: 1
count: 9
high: 62
batch: 15
vm stats threshold: 20
cpu: 1 pcp: 0
count: 177
high: 186
batch: 31
cpu: 1 pcp: 1
count: 12
high: 62
batch: 15
vm stats threshold: 20
all_unreclaimable: 0
prev_priority: 12
temp_priority: 12
start_pfn: 0

This patch:

In two places in the VM we use ZONE_DMA to refer to the first zone. If
ZONE_DMA is optional then other zones may be first. So simply replace
ZONE_DMA with zone 0.

This also fixes ZONETABLE_PGSHIFT. If we have only a single zone then
ZONES_PGSHIFT may become 0 because there is no need anymore to encode the zone
number related to a pgdat. However, we still need a zonetable to index all
the zones for each node if this is a NUMA system. Therefore define
ZONETABLE_SHIFT unconditionally as the offset of the ZONE field in page flags.

[apw@shadowen.org: fix mismerge]
Acked-by: Christoph Hellwig
Signed-off-by: Christoph Lameter
Cc: Andi Kleen
Cc: "Luck, Tony"
Cc: Kyle McMartin
Cc: Matthew Wilcox
Cc: James Bottomley
Cc: Paul Mundt
Signed-off-by: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:18 +0800
65e458d43 [PATCH] Drop get_zone_counts() ... Browse Code »

Values are available via ZVC sums.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:18 +0800
05a0416be [PATCH] Drop __get_zone_counts() ... Browse Code »

Values are readily available via ZVC per node and global sums.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:18 +0800
9195481d2 [PATCH] Drop nr_free_pages_pgdat() ... Browse Code »

Function is unnecessary now. We can use the summing features of the ZVCs to
get the values we need.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:18 +0800
961772994 [PATCH] Drop free_pages() ... Browse Code »

nr_free_pages is now a simple access to a global variable. Make it a macro
instead of a function.

The nr_free_pages now requires vmstat.h to be included. There is one
occurrence in power management where we need to add the include. Directly
refrer to global_page_state() there to clarify why the #include was added.

[akpm@osdl.org: arm build fix]
[akpm@osdl.org: sparc64 build fix]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:18 +0800
51ed44912 [PATCH] Reorder ZVCs according to cacheline ... Browse Code »

The global and per zone counter sums are in arrays of longs. Reorder the ZVCs
so that the most frequently used ZVCs are put into the same cacheline. That
way calculations of the global, node and per zone vm state touches only a
single cacheline. This is mostly important for 64 bit systems were one 128
byte cacheline takes only 8 longs.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:17 +0800
d23ad4232 [PATCH] Use ZVC for free_pages ... Browse Code »

This is again simplifies some of the VM counter calculations through the use
of the ZVC consolidated counters.

[michal.k.k.piotrowski@gmail.com: build fix]
Signed-off-by: Christoph Lameter
Signed-off-by: Michal Piotrowski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:17 +0800
c87853859 [PATCH] Use ZVC for inactive and active counts ... Browse Code »

The determination of the dirty ratio to determine writeback behavior is
currently based on the number of total pages on the system.

However, not all pages in the system may be dirtied. Thus the ratio is always
too low and can never reach 100%. The ratio may be particularly skewed if
large hugepage allocations, slab allocations or device driver buffers make
large sections of memory not available anymore. In that case we may get into
a situation in which f.e. the background writeback ratio of 40% cannot be
reached anymore which leads to undesired writeback behavior.

This patchset fixes that issue by determining the ratio based on the actual
pages that may potentially be dirty. These are the pages on the active and
the inactive list plus free pages.

The problem with those counts has so far been that it is expensive to
calculate these because counts from multiple nodes and multiple zones will
have to be summed up. This patchset makes these counters ZVC counters. This
means that a current sum per zone, per node and for the whole system is always
available via global variables and not expensive anymore to calculate.

The patchset results in some other good side effects:

- Removal of the various functions that sum up free, active and inactive
page counts

- Cleanup of the functions that display information via the proc filesystem.

This patch:

The use of a ZVC for nr_inactive and nr_active allows a simplification of some
counter operations. More ZVC functionality is used for sums etc in the
following patches.

[akpm@osdl.org: UP build fix]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:17 +0800
c3704ceb4 [PATCH] page_mkwrite caller race fix ... Browse Code »

After do_wp_page has tested page_mkwrite, it must release old_page after
acquiring page table lock, not before: at some stage that ordering got
reversed, leaving a (very unlikely) window in which old_page might be
truncated, freed, and reused in the same position.

Signed-off-by: Hugh Dickins
Acked-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2007-02-12 02:51:17 +0800
5a88a13d0 [PATCH] /proc/zoneinfo: fix vm stats display ... Browse Code »

This early break prevents us from displaying info for the vm stats thresholds
if the zone doesn't have any pages in its per-cpu pagesets.

So my 800MB i386 box says:

Node 0, zone DMA
pages free 2365
min 16
low 20
high 24
active 0
inactive 0
scanned 0 (a: 0 i: 0)
spanned 4096
present 4044
nr_anon_pages 0
nr_mapped 1
nr_file_pages 0
nr_slab_reclaimable 0
nr_slab_unreclaimable 0
nr_page_table_pages 0
nr_dirty 0
nr_writeback 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
protection: (0, 868, 868)
pagesets
all_unreclaimable: 0
prev_priority: 12
start_pfn: 0
Node 0, zone Normal
pages free 199713
min 934
low 1167
high 1401
active 10215
inactive 4507
scanned 0 (a: 0 i: 0)
spanned 225280
present 222420
nr_anon_pages 2685
nr_mapped 1110
nr_file_pages 12055
nr_slab_reclaimable 2216
nr_slab_unreclaimable 1527
nr_page_table_pages 213
nr_dirty 0
nr_writeback 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
protection: (0, 0, 0)
pagesets
cpu: 0 pcp: 0
count: 152
high: 186
batch: 31
cpu: 0 pcp: 1
count: 13
high: 62
batch: 15
vm stats threshold: 16
cpu: 1 pcp: 0
count: 34
high: 186
batch: 31
cpu: 1 pcp: 1
count: 10
high: 62
batch: 15
vm stats threshold: 16
all_unreclaimable: 0
prev_priority: 12
start_pfn: 4096

Just nuke all that search-for-the-first-non-empty-pageset code. Dunno why it
was there in the first place..

Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2007-02-12 02:51:17 +0800
a6af2bc3d [PATCH] Avoid excessive sorting of early_node_map[] ... Browse Code »

find_min_pfn_for_node() and find_min_pfn_with_active_regions() sort
early_node_map[] on every call. This is an excessive amount of sorting and
that can be avoided. This patch always searches the whole early_node_map[]
in find_min_pfn_for_node() instead of returning the first value found. The
map is then only sorted once when required. Successfully boot tested on a
number of machines.

[akpm@osdl.org: cleanup]
Signed-off-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2007-02-12 02:51:17 +0800
7c5cae368 [PATCH] slab: use parameter passed to cache_reap to determine pointer to work structure ... Browse Code »

Use the pointer passed to cache_reap to determine the work pointer and
consolidate exit paths.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:17 +0800
8c8cc2c10 [PATCH] slab: cache alloc cleanups ... Browse Code »

Clean up __cache_alloc and __cache_alloc_node functions a bit. We no
longer need to do NUMA_BUILD tricks and the UMA allocation path is much
simpler. No functional changes in this patch.

Note: saves few kernel text bytes on x86 NUMA build due to using gotos in
__cache_alloc_node() and moving __GFP_THISNODE check in to
fallback_alloc().

Cc: Andy Whitcroft
Cc: Christoph Hellwig
Cc: Manfred Spraul
Acked-by: Christoph Lameter
Cc: Paul Jackson
Signed-off-by: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pekka Enberg
2007-02-12 02:51:16 +0800
6e40e7309 [PATCH] slab: remove broken PageSlab check from kfree_debugcheck ... Browse Code »

The PageSlab debug check in kfree_debugcheck() is broken for compound
pages. It is also redundant as we already do BUG_ON for non-slab pages in
page_get_cache() and page_get_slab() which are always called before we free
any actual objects.

Signed-off-by: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pekka Enberg
2007-02-12 02:51:16 +0800