Doug / smarc-fsl-linux-kernel | Embedian Git Server

26 Sep, 2006

40 commits

6ddab3b9e [PATCH] mm: swap write failure fixup ... Browse Code »

Currently we can silently drop data if the write to swap failed. It
usually doesn't result in data-corruption because on page-in the process
will receive SIGBUS (assuming write-failure implies read-failure).

This assumption might or might not be valid.

This patch will avoid the page being discarded after a failed write. But
will print a warning the sysadmin _should_ take to heart, if a lot of swap
space becomes un-writeable, OOM is not far off.

Tested by making the write fail 'randomly' once every 50 writes or so.

[akpm@osdl.org: printk warning fix]
Signed-off-by: Peter Zijlstra
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2006-09-26 23:48:48 +0800
ca5f9703d [PATCH] slab: respect architecture and caller mandated alignment ... Browse Code »

As explained by Heiko, on s390 (32-bit) ARCH_KMALLOC_MINALIGN is set to
eight because their common I/O layer allocates data structures that need to
have an eight byte alignment. This does not work when CONFIG_SLAB_DEBUG is
enabled because kmem_cache_create will override alignment to BYTES_PER_WORD
which is four.

So change kmem_cache_create to ensure cache alignment is always at minimum
what the architecture or caller mandates even if slab debugging is enabled.

Cc: Heiko Carstens
Cc: Christoph Lameter
Signed-off-by: Manfred Spraul
Signed-off-by: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pekka Enberg
2006-09-26 23:48:48 +0800
db37648cd [PATCH] mm: non syncing lock_page() ... Browse Code »

lock_page needs the caller to have a reference on the page->mapping inode
due to sync_page, ergo set_page_dirty_lock is obviously buggy according to
its comments.

Solve it by introducing a new lock_page_nosync which does not do a sync_page.

akpm: unpleasant solution to an unpleasant problem. If it goes wrong it could
cause great slowdowns while the lock_page() caller waits for kblockd to
perform the unplug. And if a filesystem has special sync_page() requirements
(none presently do), permanent hangs are possible.

otoh, set_page_dirty_lock() is usually (always?) called against userspace
pages. They are always up-to-date, so there shouldn't be any pending read I/O
against these pages.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2006-09-26 23:48:48 +0800
28e4d965e [PATCH] mm: remove_mapping() safeness ... Browse Code »

Some users of remove_mapping had been unsafe.

Modify the remove_mapping precondition to ensure the caller has locked the
page and obtained the correct mapping. Modify callers to ensure the
mapping is the correct one.

[hugh@veritas.com: swapper_space fix]
Signed-off-by: Nick Piggin
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2006-09-26 23:48:48 +0800
bfa5bf6d6 [PATCH] Add kerneldocs for some functions in mm/memory.c ... Browse Code »

These functions are already documented quite well with long comments. Now
add kerneldoc style header to make this turn up in everyones favorite doc
format.

Signed-off-by: Rolf Eike Beer
Cc: "Randy.Dunlap"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rolf Eike Beer
2006-09-26 23:48:47 +0800
7ff6f0829 [PATCH] CPU hotplug compatible alloc_percpu() ... Browse Code »

This patch splits alloc_percpu() up into two phases. Likewise for
free_percpu(). This allows clients to limit initial allocations to online
cpu's, and to populate or depopulate per-cpu data at run time as needed:

struct my_struct *obj;

/* initial allocation for online cpu's */
obj = percpu_alloc(sizeof(struct my_struct), GFP_KERNEL);

...

/* populate per-cpu data for cpu coming online */
ptr = percpu_populate(obj, sizeof(struct my_struct), GFP_KERNEL, cpu);

...

/* access per-cpu object */
ptr = percpu_ptr(obj, smp_processor_id());

...

/* depopulate per-cpu data for cpu going offline */
percpu_depopulate(obj, cpu);

...

/* final removal */
percpu_free(obj);

Signed-off-by: Martin Peschke
Cc: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Martin Peschke
2006-09-26 23:48:47 +0800
8bc719d3c [PATCH] out of memory notifier ... Browse Code »

Add a notifer chain to the out of memory killer. If one of the registered
callbacks could release some memory, do not kill the process but return and
retry the allocation that forced the oom killer to run.

The purpose of the notifier is to add a safety net in the presence of
memory ballooners. If the resource manager inflated the balloon to a size
where memory allocations can not be satisfied anymore, it is better to
deflate the balloon a bit instead of killing processes.

The implementation for the s390 ballooner is included.

[akpm@osdl.org: cleanups]
Signed-off-by: Martin Schwidefsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Martin Schwidefsky
2006-09-26 23:48:47 +0800
19655d348 [PATCH] linearly index zone->node_zonelists[] ... Browse Code »

I wonder why we need this bitmask indexing into zone->node_zonelists[]?

We always start with the highest zone and then include all lower zones
if we build zonelists.

Are there really cases where we need allocation from ZONE_DMA or
ZONE_HIGHMEM but not ZONE_NORMAL? It seems that the current implementation
of highest_zone() makes that already impossible.

If we go linear on the index then gfp_zone() == highest_zone() and a lot
of definitions fall by the wayside.

We can now revert back to the use of gfp_zone() in mempolicy.c ;-)

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:47 +0800
2f6726e54 [PATCH] Apply type enum zone_type ... Browse Code »

After we have done this we can now do some typing cleanup.

The memory policy layer keeps a policy_zone that specifies
the zone that gets memory policies applied. This variable
can now be of type enum zone_type.

The check_highest_zone function and the build_zonelists funnctionm must
then also take a enum zone_type parameter.

Plus there are a number of loops over zones that also should use
zone_type.

We run into some troubles at some points with functions that need a
zone_type variable to become -1. Fix that up.

[pj@sgi.com: fix set_mempolicy() crash]
Signed-off-by: Christoph Lameter
Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:47 +0800
4e4785bcf [PATCH] mempolicies: fix policy_zone check ... Browse Code »

There is a check in zonelist_policy that compares pieces of the bitmap
obtained from a gfp mask via GFP_ZONETYPES with a zone number in function
zonelist_policy().

The bitmap is an ORed mask of __GFP_DMA, __GFP_DMA32 and __GFP_HIGHMEM.
The policy_zone is a zone number with the possible values of ZONE_DMA,
ZONE_DMA32, ZONE_HIGHMEM and ZONE_NORMAL. These are two different domains
of values.

For some reason seemed to work before the zone reduction patchset (It
definitely works on SGI boxes since we just have one zone and the check
cannot fail).

With the zone reduction patchset this check definitely fails on systems
with two zones if the system actually has memory in both zones.

This is because ZONE_NORMAL is selected using no __GFP flag at
all and thus gfp_zone(gfpmask) == 0. ZONE_DMA is selected when __GFP_DMA
is set. __GFP_DMA is 0x01. So gfp_zone(gfpmask) == 1.

policy_zone is set to ZONE_NORMAL (==1) if ZONE_NORMAL and ZONE_DMA are
populated.

For ZONE_NORMAL gfp_zone() yields 0 which is <
policy_zone(ZONE_NORMAL) and so policy is not applied to regular memory
allocations!

Instead gfp_zone(__GFP_DMA) == 1 which results in policy being applied
to DMA allocations!

What we realy want in that place is to establish the highest allowable
zone for a given gfp_mask. If the highest zone is higher or equal to the
policy_zone then memory policies need to be applied. We have such
a highest_zone() function in page_alloc.c.

So move the highest_zone() function from mm/page_alloc.c into
include/linux/gfp.h. On the way we simplify the function and use the new
zone_type that was also introduced with the zone reduction patchset plus we
also specify the right type for the gfp flags parameter.

Signed-off-by: Christoph Lameter
Signed-off-by: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:47 +0800
b9b15780f [PATCH] reduce MAX_NR_ZONES: fix i386 SRAT check for MAX_NR_ZONES ... Browse Code »

We cannot check MAX_NR_ZONES since it not defined in the preprocessor
anymore.

So remove the check.

The maximum number of zones per node for i386 is 3 since i386 does not
support ZONE_DMA32.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:47 +0800
27bf71c2a [PATCH] reduce MAX_NR_ZONES: remove display of counters for unconfigured zones ... Browse Code »

eventcounters: Do not display counters for zones that are not available on an
arch

Do not define or display counters for the DMA32 and the HIGHMEM zone if such
zones were not configured.

[akpm@osdl.org: s390 fix]
[heiko.carstens@de.ibm.com: s390 fix]
Signed-off-by: Christoph Lameter
Cc: Martin Schwidefsky
Signed-off-by: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:47 +0800
e53ef38d0 [PATCH] reduce MAX_NR_ZONES: make ZONE_HIGHMEM optional ... Browse Code »

Make ZONE_HIGHMEM optional

- ifdef out code and definitions related to CONFIG_HIGHMEM

- __GFP_HIGHMEM falls back to normal allocations if there is no
ZONE_HIGHMEM

- GFP_ZONEMASK becomes 0x01 if there is no DMA32 and no HIGHMEM
zone.

[jdike@addtoit.com: build fix]
Signed-off-by: Jeff Dike
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:46 +0800
fb0e7942b [PATCH] reduce MAX_NR_ZONES: make ZONE_DMA32 optional ... Browse Code »

Make ZONE_DMA32 optional

- Add #ifdefs around ZONE_DMA32 specific code and definitions.

- Add CONFIG_ZONE_DMA32 config option and use that for x86_64
that alone needs this zone.

- Remove the use of CONFIG_DMA_IS_DMA32 and CONFIG_DMA_IS_NORMAL
for ia64 and fix up the way per node ZVCs are calculated.

- Fall back to prior GFP_ZONEMASK of 0x03 if there is no
DMA32 zone.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:46 +0800
2f1b62486 [PATCH] reduce MAX_NR_ZONES: use enum to define zones, reformat and comment ... Browse Code »

Use enum for zones and reformat zones dependent information

Add comments explaning the use of zones and add a zones_t type for zone
numbers.

Line up information that will be #ifdefd by the following patches.

[akpm@osdl.org: comment cleanups]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:46 +0800
98d2b0ebd [PATCH] reduce MAX_NR_ZONES: page allocator ZONE_HIGHMEM cleanup ... Browse Code »

page allocator ZONE_HIGHMEM fixups

1. We do not need to do an #ifdef in si_meminfo since both counters
in use are zero if !CONFIG_HIGHMEM.

2. Add #ifdef in si_meminfo_node instead to avoid referencing zone
information for ZONE_HIGHMEM if we do not have HIGHMEM
(may not be there after the following patches).

3. Replace the use of ZONE_HIGHMEM with MAX_NR_ZONES in build_zonelists_node

4. build_zonelists_node: Remove BUG_ON for ZONE_HIGHMEM. Zone will
be optional soon and thus BUG_ON cannot be triggered anymore.

5. init_free_area_core: Replace a use of ZONE_HIGHMEM with NR_MAX_ZONES.

[akpm@osdl.org: cleanups]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:46 +0800
c1f60a5a4 [PATCH] reduce MAX_NR_ZONES: move HIGHMEM counters into highmem.c/.h ... Browse Code »

Move totalhigh_pages and nr_free_highpages() into highmem.c/.h

Move the totalhigh_pages definition into highmem.c/.h. Move the
nr_free_highpages function into highmem.c

[yoichi_yuasa@tripeaks.co.jp: build fix]
Signed-off-by: Christoph Lameter
Signed-off-by: Yoichi Yuasa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:46 +0800
182e8e237 [PATCH] reduce MAX_NR_ZONES: make display of highmem counters conditional on CONFIG_HIGHMEM ... Browse Code »

Do not display HIGHMEM memory sizes if CONFIG_HIGHMEM is not set.

Make HIGHMEM dependent texts and make display of highmem counters optional

Some texts are depending on CONFIG_HIGHMEM.

Remove those strings and remove the display of highmem counter values if
CONFIG_HIGHMEM is not set.

[akpm@osdl.org: remove some ifdefs]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:46 +0800
f06a96844 [PATCH] reduce MAX_NR_ZONES: fix MAX_NR_ZONES array initializations ... Browse Code »

Fix array initialization in lots of arches

The number of zones may now be reduced from 4 to 2 for many arches. Fix the
array initialization for the zones array for all architectures so that it is
not initializing a fixed number of elements.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:46 +0800
776ed98b8 [PATCH] reduce MAX_NR_ZONES: remove two strange uses of MAX_NR_ZONES ... Browse Code »

I keep seeing zones on various platforms that are never used and wonder why we
compile support for them into the kernel. Counters show up for HIGHMEM and
DMA32 that are alway zero.

This patch allows the removal of ZONE_DMA32 for non x86_64 architectures and
it will get rid of ZONE_HIGHMEM for arches not using highmem (like 64 bit
architectures). If an arch does not define CONFIG_HIGHMEM then ZONE_HIGHMEM
will not be defined. Similarly if an arch does not define CONFIG_ZONE_DMA32
then ZONE_DMA32 will not be defined.

No current architecture uses all the 4 zones (DMA,DMA32,NORMAL,HIGH) that we
have now. The patchset will reduce the number of zones for all platforms.

On many platforms that do not have DMA32 or HIGHMEM this will reduce the
number of zones by 50%. F.e. ia64 only uses DMA and NORMAL.

Large amounts of memory can be saved for larger systemss that may have a few
hundred NUMA nodes.

With ZONE_DMA32 and ZONE_HIGHMEM support optional MAX_NR_ZONES will be 2 for
many non i386 platforms and even for i386 without CONFIG_HIGHMEM set.

Tested on ia64, x86_64 and on i386 with and without highmem.

The patchset consists of 11 patches that are following this message.

One could go even further than this patchset and also make ZONE_DMA optional
because some platforms do not need a separate DMA zone and can do DMA to all
of memory. This could reduce MAX_NR_ZONES to 1. Such a patchset will
hopefully follow soon.

This patch:

Fix strange uses of MAX_NR_ZONES

Sometimes we use MAX_NR_ZONES - x to refer to a zone. Make that explicit.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:46 +0800
f71bf0cac [PATCH] bootmem: miscellaneous coding style fixes ... Browse Code »

It fixes various coding style issues, specially when spaces are useless. For
example '*' go next to the function name.

Signed-off-by: Franck Bui-Huu
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Franck Bui-Huu
2006-09-26 23:48:45 +0800
bbc7b92e3 [PATCH] bootmem: use pfn/page conversion macros ... Browse Code »

It also creates get_mapsize() helper in order to make the code more readable
when it calculates the boot bitmap size.

Signed-off-by: Franck Bui-Huu
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Franck Bui-Huu
2006-09-26 23:48:45 +0800
e786e86a5 [PATCH] bootmem: remove useless headers inclusions ... Browse Code »

Signed-off-by: Franck Bui-Huu
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Franck Bui-Huu
2006-09-26 23:48:45 +0800
bb0923a66 [PATCH] bootmem: limit to 80 columns width ... Browse Code »

Signed-off-by: Franck Bui-Huu
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Franck Bui-Huu
2006-09-26 23:48:45 +0800
71fb2e8f8 [PATCH] bootmem: remove useless parentheses in bootmem header file ... Browse Code »

Signed-off-by: Franck Bui-Huu
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Franck Bui-Huu
2006-09-26 23:48:45 +0800
69d49e681 [PATCH] bootmem: mark link_bootmem() as part of the __init section ... Browse Code »

Signed-off-by: Franck Bui-Huu
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Franck Bui-Huu
2006-09-26 23:48:45 +0800
2d1a07d48 [PATCH] bootmem: remove useless __init in header file ... Browse Code »

__init in headers is pretty useless because the compiler doesn't check it, and
they get out of sync relatively frequently. So if you see an __init in a
header file, it's quite unreliable and you need to check the definition
anyway.

Signed-off-by: Franck Bui-Huu
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Franck Bui-Huu
2006-09-26 23:48:45 +0800
910233000 [PATCH] convert i386 NUMA KVA space to bootmem ... Browse Code »

Address a long standing issue of booting with an initrd on an i386 numa
system. Currently (and always) the numa kva area is mapped into low memory
by finding the end of low memory and moving that mark down (thus creating
space for the kva). The issue with this is that Grub loads initrds into
this similar space so when the kernel check the initrd it finds it outside
max_low_pfn and disables it (it thinks the initrd is not mapped into usable
memory) thus initrd enabled kernels can't boot i386 numa :(

My solution to the problem just converts the numa kva area to use the
bootmem allocator to save it's area (instead of moving the end of low
memory). Using bootmem allows the kva area to be mapped into more diverse
addresses (not just the end of low memory) and enables the kva area to be
mapped below the initrd if present.

I have tested this patch on numaq(no initrd) and summit(initrd) i386 numa
based systems.

[akpm@osdl.org: cleanups]
Signed-off-by: Keith Mannthey
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

keith mannthey
2006-09-26 23:48:45 +0800
b221385bc [PATCH] mm/: make functions static ... Browse Code »

This patch makes the following needlessly global functions static:
- slab.c: kmem_find_general_cachep()
- swap.c: __page_cache_release()
- vmalloc.c: __vmalloc_node()

Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2006-09-26 23:48:45 +0800
204ec841f [PATCH] mm: msync() cleanup ... Browse Code »

With the tracking of dirty pages properly done now, msync doesn't need to scan
the PTEs anymore to determine the dirty status.

From: Hugh Dickins

In looking to do that, I made some other tidyups: can remove several
#includes, and sys_msync loop termination not quite right.

Most of those points are criticisms of the existing sys_msync, not of your
patch. In particular, the loop termination errors were introduced in 2.6.17:
I did notice this shortly before it came out, but decided I was more likely to
get it wrong myself, and make matters worse if I tried to rush a last-minute
fix in. And it's not terribly likely to go wrong, nor disastrous if it does
go wrong (may miss reporting an unmapped area; may also fsync file of a
following vma).

Signed-off-by: Peter Zijlstra
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2006-09-26 23:48:45 +0800
ee6a64578 [PATCH] mm: fixup do_wp_page() ... Browse Code »

Wrt. the recent modifications in do_wp_page() Hugh Dickins pointed out:

"I now realize it's right to the first order (normal case) and to the
second order (ptrace poke), but not to the third order (ptrace poke
anon page here to be COWed - perhaps can't occur without intervening
mprotects)."

This patch restores the old COW behaviour for anonymous pages.

Signed-off-by: Peter Zijlstra
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2006-09-26 23:48:44 +0800
e88dd6c11 [PATCH] mm: small cleanup of install_page() ... Browse Code »

Smallish cleanup to install_page(), could save a memory read (haven't checked
the asm output) and sure looks nicer.

Signed-off-by: Peter Zijlstra
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2006-09-26 23:48:44 +0800
c1e6098b2 [PATCH] mm: optimize the new mprotect() code a bit ... Browse Code »

mprotect() resets the page protections, which could result in extra write
faults for those pages whose dirty state we track using write faults and are
dirty already.

Signed-off-by: Peter Zijlstra
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2006-09-26 23:48:44 +0800
edc79b2a4 [PATCH] mm: balance dirty pages ... Browse Code »

Now that we can detect writers of shared mappings, throttle them. Avoids OOM
by surprise.

Signed-off-by: Peter Zijlstra
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2006-09-26 23:48:44 +0800
d08b3851d [PATCH] mm: tracking shared dirty pages ... Browse Code »

Tracking of dirty pages in shared writeable mmap()s.

The idea is simple: write protect clean shared writeable pages, catch the
write-fault, make writeable and set dirty. On page write-back clean all the
PTE dirty bits and write protect them once again.

The implementation is a tad harder, mainly because the default
backing_dev_info capabilities were too loosely maintained. Hence it is not
enough to test the backing_dev_info for cap_account_dirty.

The current heuristic is as follows, a VMA is eligible when:
- its shared writeable
(vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED)
- it is not a 'special' mapping
(vm_flags & (VM_PFNMAP|VM_INSERTPAGE)) == 0
- the backing_dev_info is cap_account_dirty
mapping_cap_account_dirty(vma->vm_file->f_mapping)
- f_op->mmap() didn't change the default page protection

Page from remap_pfn_range() are explicitly excluded because their COW
semantics are already horrid enough (see vm_normal_page() in do_wp_page()) and
because they don't have a backing store anyway.

mprotect() is taught about the new behaviour as well. However it overrides
the last condition.

Cleaning the pages on write-back is done with page_mkclean() a new rmap call.
It can be called on any page, but is currently only implemented for mapped
pages, if the page is found the be of a VMA that accounts dirty pages it will
also wrprotect the PTE.

Finally, in fs/buffers.c:try_to_free_buffers(); remove clear_page_dirty() from
under ->private_lock. This seems to be safe, since ->private_lock is used to
serialize access to the buffers, not the page itself. This is needed because
clear_page_dirty() will call into page_mkclean() and would thereby violate
locking order.

[dhowells@redhat.com: Provide a page_mkclean() implementation for NOMMU]
Signed-off-by: Peter Zijlstra
Cc: Hugh Dickins
Signed-off-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2006-09-26 23:48:44 +0800
725d704ec [PATCH] mm: VM_BUG_ON ... Browse Code »

Introduce a VM_BUG_ON, which is turned on with CONFIG_DEBUG_VM. Use this
in the lightweight, inline refcounting functions; PageLRU and PageActive
checks in vmscan, because they're pretty well confined to vmscan. And in
page allocate/free fastpaths which can be the hottest parts of the kernel
for kbuilds.

Unlike BUG_ON, VM_BUG_ON must not be used to execute statements with
side-effects, and should not be used outside core mm code.

Signed-off-by: Nick Piggin
Cc: Hugh Dickins
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2006-09-26 23:48:44 +0800
a6ca1b99e [PATCH] update to the kernel kmap/kunmap API ... Browse Code »

Give non-highmem architectures access to the kmap API for the purposes of
overriding (this is what the attached patch does).

The proposal is that we should now require all architectures with coherence
issues to manage data coherence via the kmap/kunmap API. Thus driver
writers never have to write code like

kmap(page)
modify data in page
flush_kernel_dcache_page(page)
kunmap(page)

instead, kmap/kunmap will manage the coherence and driver (and filesystem)
writers don't need to worry about how to flush between kmap and kunmap.

For most architectures, the page only needs to be flushed if it was
actually written to *and* there are user mappings of it, so the best
implementation looks to be: clear the page dirty pte bit in the kernel page
tables on kmap and on kunmap, check page->mappings for user maps, and then
the dirty bit, and only flush if it both has user mappings and is dirty.

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

James Bottomley
2006-09-26 23:48:44 +0800
3998b9301 [PATCH] jbd: fix commit of ordered data buffers ... Browse Code »

Original commit code assumes, that when a buffer on BJ_SyncData list is
locked, it is being written to disk. But this is not true and hence it can
lead to a potential data loss on crash. Also the code didn't count with
the fact that journal_dirty_data() can steal buffers from committing
transaction and hence could write buffers that no longer belong to the
committing transaction. Finally it could possibly happen that we tried
writing out one buffer several times.

The patch below tries to solve these problems by a complete rewrite of the
data commit code. We go through buffers on t_sync_datalist, lock buffers
needing write out and store them in an array. Buffers are also immediately
refiled to BJ_Locked list or unfiled (if the write out is completed). When
the array is full or we have to block on buffer lock, we submit all
accumulated buffers for IO.

[suitable for 2.6.18.x around the 2.6.19-rc2 timeframe]

Signed-off-by: Jan Kara
Cc: Badari Pulavarty
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2006-09-26 23:48:44 +0800
632bbfeee [PATCH] trigger a syntax error if percpu macros are incorrectly used ... Browse Code »

get_cpu_var()/per_cpu()/__get_cpu_var() arguments must be simple
identifiers. Otherwise the arch dependent implementations might break.

This patch enforces the correct usage of the macros by producing a syntax
error if the variable is not a simple identifier.

Signed-off-by: Jan Blunck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Blunck
2006-09-26 23:48:44 +0800
0a2966b48 [PATCH] Fix longstanding load balancing bug in the scheduler ... Browse Code »

The scheduler will stop load balancing if the most busy processor contains
processes pinned via processor affinity.

The scheduler currently only does one search for busiest cpu. If it cannot
pull any tasks away from the busiest cpu because they were pinned then the
scheduler goes into a corner and sulks leaving the idle processors idle.

F.e. If you have processor 0 busy running four tasks pinned via taskset,
there are none on processor 1 and one just started two processes on
processor 2 then the scheduler will not move one of the two processes away
from processor 2.

This patch fixes that issue by forcing the scheduler to come out of its
corner and retrying the load balancing by considering other processors for
load balancing.

This patch was originally developed by John Hawkes and discussed at

http://marc.theaimsgroup.com/?l=linux-kernel&m=113901368523205&w=2.

I have removed extraneous material and gone back to equipping struct rq
with the cpu the queue is associated with since this makes the patch much
easier and it is likely that others in the future will have the same
difficulty of figuring out which processor owns which runqueue.

The overhead added through these patches is a single word on the stack if
the kernel is configured to support 32 cpus or less (32 bit). For 32 bit
environments the maximum number of cpus that can be configued is 255 which
would result in the use of 32 bytes additional on the stack. On IA64 up to
1k cpus can be configured which will result in the use of 128 additional
bytes on the stack. The maximum additional cache footprint is one
cacheline. Typically memory use will be much less than a cacheline and the
additional cpumask will be placed on the stack in a cacheline that already
contains other local variable.

Signed-off-by: Christoph Lameter
Cc: John Hawkes
Cc: "Siddha, Suresh B"
Cc: Ingo Molnar
Cc: Nick Piggin
Cc: Peter Williams
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:43 +0800