Eric Lee / smarc-fsl-linux-kernel

23 Sep, 2006

1 commit

6585b5724 Merge master.kernel.org:/pub/scm/linux/kernel/git/davej/agpgart ... Browse Code »

* master.kernel.org:/pub/scm/linux/kernel/git/davej/agpgart:
[AGPGART] Rework AGPv3 modesetting fallback.
[AGPGART] Add suspend callback for i965
[AGPGART] Fix number of aperture sizes in 830 gart structs.
[AGPGART] Intel 965 Express support.
[AGPGART] agp.h: constify struct agp_bridge_data::version
[AGPGART] const'ify VIA AGP PCI table.
[AGPGART] CONFIG_PM=n slim: drivers/char/agp/intel-agp.c
[AGPGART] CONFIG_PM=n slim: drivers/char/agp/efficeon-agp.c
[AGPGART] Const'ify the agpgart driver version.
[AGPGART] remove private page protection map

Linus Torvalds
2006-09-23 08:50:50 +0800

09 Sep, 2006

1 commit

016eb4a0e [PATCH] invalidate_complete_page() race fix ... Browse Code »

If a CPU faults this page into pagetables after invalidate_mapping_pages()
checked page_mapped(), invalidate_complete_page() will still proceed to remove
the page from pagecache. This leaves the page-faulting process with a
detached page. If it was MAP_SHARED then file data loss will ensue.

Fix that up by checking the page's refcount after taking tree_lock.

Cc: Nick Piggin
Cc: Hugh Dickins
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2006-09-09 01:22:50 +0800

08 Sep, 2006

1 commit

3a4597568 [PATCH] IA64,sparc: local DoS with corrupted ELFs ... Browse Code »

This prevents cross-region mappings on IA64 and SPARC which could lead
to system crash. They were correctly trapped for normal mmap() calls,
but not for the kernel internal calls generated by executable loading.

This code just moves the architecture-specific cross-region checks into
an arch-specific "arch_mmap_check()" macro, and defines that for the
architectures that needed it (ia64, sparc and sparc64).

Architectures that don't have any special requirements can just ignore
the new cross-region check, since the mmap() code will just notice on
its own when the macro isn't defined.

Signed-off-by: Pavel Emelianov
Signed-off-by: Kirill Korotaev
Acked-by: David Miller
Signed-off-by: Greg Kroah-Hartman
[ Cleaned up to not affect architectures that don't need it ]
Signed-off-by: Linus Torvalds

Kirill Korotaev
2006-09-08 23:40:46 +0800

06 Sep, 2006

1 commit

115b384cf Merge ../linus Browse Code »

Dave Jones
2006-09-06 05:20:21 +0800

02 Sep, 2006

4 commits

3b98b087f [PATCH] fix NUMA interleaving for huge pages ... Browse Code »

Since vma->vm_pgoff is in units of smallpages, VMAs for huge pages have the
lower HPAGE_SHIFT - PAGE_SHIFT bits always cleared, which results in badd
offsets to the interleave functions. Take this difference from small pages
into account when calculating the offset. This does add a 0-bit shift into
the small-page path (via alloc_page_vma()), but I think that is negligible.
Also add a BUG_ON to prevent the offset from growing due to a negative
right-shift, which probably shouldn't be allowed anyways.

Tested on an 8-memory node ppc64 NUMA box and got the interleaving I
expected.

Signed-off-by: Nishanth Aravamudan
Signed-off-by: Adam Litke
Cc: Andi Kleen
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nishanth Aravamudan
2006-09-02 02:39:10 +0800
0b1d647a0 [PATCH] dm: work around mempool_alloc, bio_alloc_bioset deadlocks ... Browse Code »

This patch works around a complex dm-related deadlock/livelock down in the
mempool allocator.

Alasdair said:

Several dm targets suffer from this.

Mempools are not yet used correctly everywhere in device-mapper: they can
get shared when devices are stacked, and some targets share them across
multiple instances. I made fixing this one of the prerequisites for this
patch:

md-dm-reduce-stack-usage-with-stacked-block-devices.patch

which in some cases makes people more likely to hit the problem.

There's been some progress on this recently with (unfinished) dm-crypt
patches at:

http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/
(dm-crypt-move-io-to-workqueue.patch plus dependencies)

and:

I've no problems with a temporary workaround like that, but Milan Broz (a
new Redhat developer in the Czech Republic) has started reviewing all the
mempool usage in device-mapper so I'm expecting we'll soon have a proper fix
for this associated problems. [He's back from holiday at the start of next
week.]

For now, this sad-but-safe little patch will allow the machine to recover.

[akpm@osdl.org: rewrote changelog]
Cc: Alasdair G Kergon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Mironchik
2006-09-02 02:39:09 +0800
df9ecaba3 [PATCH] ZVC: Scale thresholds depending on the size of the system ... Browse Code »

The ZVC counter update threshold is currently set to a fixed value of 32.
This patch sets up the threshold depending on the number of processors and
the sizes of the zones in the system.

With the current threshold of 32, I was able to observe slight contention
when more than 130-140 processors concurrently updated the counters. The
contention vanished when I either increased the threshold to 64 or used
Andrew's idea of overstepping the interval (see ZVC overstep patch).

However, we saw contention again at 220-230 processors. So we need higher
values for larger systems.

But the current default is already a bit of an overkill for smaller
systems. Some systems have tiny zones where precision matters. For
example i386 and x86_64 have 16M DMA zones and either 900M ZONE_NORMAL or
ZONE_DMA32. These are even present on SMP and NUMA systems.

The patch here sets up a threshold based on the number of processors in the
system and the size of the zone that these counters are used for. The
threshold should grow logarithmically, so we use fls() as an easy
approximation.

Results of tests on a system with 1024 processors (4TB RAM)

The following output is from a test allocating 1GB of memory concurrently
on each processor (Forking the process. So contention on mmap_sem and the
pte locks is not a factor):

X MIN
TYPE: CPUS WALL WALL SYS USER TOTCPU
fork 1 0.552 0.552 0.540 0.012 0.552
fork 4 0.552 0.548 2.164 0.036 2.200
fork 16 0.564 0.548 8.812 0.164 8.976
fork 128 0.580 0.572 72.204 1.208 73.412
fork 256 1.300 0.660 310.400 2.160 312.560
fork 512 3.512 0.696 1526.836 4.816 1531.652
fork 1020 20.024 0.700 17243.176 6.688 17249.863

So a threshold of 32 is fine up to 128 processors. At 256 processors contention
becomes a factor.

Overstepping the counter (earlier patch) improves the numbers a bit:

fork 4 0.552 0.548 2.164 0.040 2.204
fork 16 0.552 0.548 8.640 0.148 8.788
fork 128 0.556 0.548 69.676 0.956 70.632
fork 256 0.876 0.636 212.468 2.108 214.576
fork 512 2.276 0.672 997.324 4.260 1001.584
fork 1020 13.564 0.680 11586.436 6.088 11592.523

Still contention at 512 and 1020. Contention at 1020 is down by a third.
256 still has a slight bit of contention.

After this patch the counter threshold will be set to 125 which reduces
contention significantly:

fork 128 0.560 0.548 69.776 0.932 70.708
fork 256 0.636 0.556 143.460 2.036 145.496
fork 512 0.640 0.548 284.244 4.236 288.480
fork 1020 1.500 0.588 1326.152 8.892 1335.044

[akpm@osdl.org: !SMP build fix]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-02 02:39:08 +0800
a302eb4e4 [PATCH] ZVC: Overstep counters ... Browse Code »

Increments and decrements are usually grouped rather than mixed. We can
optimize the inc and dec functions for that case.

Increment and decrement the counters by 50% more than the threshold in
those cases and set the differential accordingly. This decreases the need
to update the atomic counters.

The idea came originally from Andrew Morton. The overstepping alone was
sufficient to address the contention issue found when updating the global
and the per zone counters from 160 processors.

Also remove some code in dec_zone_page_state.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-02 02:39:08 +0800

28 Aug, 2006

1 commit

b6b5bce35 [PATCH] swsusp: Fix swap_type_of ... Browse Code »

There is a bug in mm/swapfile.c#swap_type_of() that makes swsusp only be
able to use the first active swap partition as the resume device. Fix it.

Signed-off-by: Rafael J. Wysocki
Cc: Hugh Dickins
Acked-by: Pavel Machek
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rafael J. Wysocki
2006-08-28 02:01:28 +0800

15 Aug, 2006

1 commit

1d7ea7324 [PATCH] fuse: fix error case in fuse_readpages ... Browse Code »

Don't let fuse_readpages leave the @pages list not empty when exiting
on error.

[akpm@osdl.org: kernel-doc fixes]
Signed-off-by: Alexander Zarochentsev
Signed-off-by: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Alexander Zarochentsev
2006-08-15 03:54:29 +0800

06 Aug, 2006

4 commits

ebd15302d [PATCH] memory hotadd fixes: enhance collision check ... Browse Code »

This patch is for collision check enhancement for memory hot add.

It's better to do resouce collision check before doing memory hot add,
which will touch memory management structures.

And add_section() should check section exists or not before calling
sparse_add_one_section(). (sparse_add_one_section() will do another
check anyway. but checking in memory_hotplug.c will be easy to understand.)

Signed-off-by: KAMEZAWA Hiroyuki
Cc: keith mannthey
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2006-08-06 23:57:49 +0800
58c1b5b07 [PATCH] memory hotadd fixes: find_next_system_ram catch range fix ... Browse Code »

find_next_system_ram() is used to find available memory resource at onlining
newly added memory. This patch fixes following problem.

find_next_system_ram() cannot catch this case.

Resource: (start)-------------(end)
Section : (start)-------------(end)

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Keith Mannthey
Cc: Yasunori Goto
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2006-08-06 23:57:48 +0800
6f712711d [PATCH] memory hotadd fixes: not-aligned memory hotadd handling fix ... Browse Code »

ioresouce handling code in memory hotplug allows not-aligned memory hot add.
But when memmap and other memory structures are initialized, parameters should
be aligned. (if not aligned, initialization of mem_map will do wrong, it
assumes parameters are aligned.) This patch fix it.

And this patch allows ioresource collision check to handle -EEXIST.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Keith Mannthey
Cc: Yasunori Goto
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2006-08-06 23:57:48 +0800
60c371bc7 [PATCH] fadvise() make POSIX_FADV_NOREUSE a no-op ... Browse Code »

The POSIX_FADV_NOREUSE hint means "the application will use this range of the
file a single time". It seems to be intended that the implementation will use
this hint to perform drop-behind of that part of the file when the application
gets around to reading or writing it.

However for reasons which aren't obvious (or sane?) I mapped
POSIX_FADV_NOREUSE onto POSIX_FADV_WILLNEED. ie: it does readahead.

That's daft. So for now, make POSIX_FADV_NOREUSE a no-op.

This is a non-back-compatible change. If someone was using POSIX_FADV_NOREUSE
to perform readahead, they lose. The likelihood is low.

If/when we later implement POSIX_FADV_NOREUSE things will get interesting - to
do it fully we'll need to maintain file offset/length ranges and peform all
sorts of complex tricks, and managing the lifetime of those ranges' data
structures will be interesting..

A sensible implementation would probably ignore the file range and would
simply mark the entire file as needing some form of drop-behind treatment.

Cc: Michael Kerrisk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2006-08-06 23:57:47 +0800

01 Aug, 2006

2 commits

b8008b2bc [PATCH] Fix kmem_cache_alloc() been documented twice ... Browse Code »

kmem_cache_alloc() was documented twice, but kmem_cache_zalloc() never.
Fix this obvious typo to get things right.

Signed-off-by: Rolf Eike Beer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rolf Eike Beer
2006-08-01 04:28:43 +0800
8c78f3075 [PATCH] cpu hotplug: replace __devinit* with __cpuinit* for cpu notifications ... Browse Code »

Few of the callback functions and notifier blocks that are associated with cpu
notifications incorrectly have __devinit and __devinitdata. They should be
__cpuinit and __cpuinitdata instead.

It makes no functional difference but wastes text area when CONFIG_HOTPLUG is
enabled and CONFIG_HOTPLUG_CPU is not.

This patch fixes all those instances.

Signed-off-by: Chandra Seetharaman
Cc: Ashok Raj
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chandra Seetharaman
2006-08-01 04:28:39 +0800

30 Jul, 2006

1 commit

b83a8e64f [PATCH] MM: Remove rogue readahead printk ... Browse Code »

For some reason it triggers always with NFS root and spams the kernel
logs of my nfs root boxes a lot.

Signed-off-by: Andi Kleen
Acked-by: Trond Myklebust
Signed-off-by: Linus Torvalds

Andi Kleen
2006-07-30 11:59:55 +0800

27 Jul, 2006

1 commit

804af2cf6 [AGPGART] remove private page protection map ... Browse Code »

AGP keeps its own copy of the protection_map, upcoming DRM changes will
also require access to this map from modules.

Signed-off-by: Hugh Dickins
Signed-off-by: Dave Airlie
Signed-off-by: Dave Jones

Hugh Dickins
2006-07-27 07:58:39 +0800

15 Jul, 2006

4 commits

0ff922452 [PATCH] per-task-delay-accounting: sync block I/O and swapin delay collection ... Browse Code »

Unlike earlier iterations of the delay accounting patches, now delays are only
collected for the actual I/O waits rather than try and cover the delays seen
in I/O submission paths.

Account separately for block I/O delays incurred as a result of swapin page
faults whose frequency can be affected by the task/process' rss limit. Hence
swapin delays can act as feedback for rss limit changes independent of I/O
priority changes.

Signed-off-by: Shailabh Nagar
Signed-off-by: Balbir Singh
Cc: Jes Sorensen
Cc: Peter Chubb
Cc: Erich Focht
Cc: Levent Serinol
Cc: Jay Lan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shailabh Nagar
2006-07-15 12:53:56 +0800
22c4af409 [PATCH] nommu: export two symbols for drivers to use ... Browse Code »

nommu.c needs to export two more symbols for drivers to use:
remap_pfn_range and unmap_mapping_range.

Signed-off-by: Luke Yang
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Luke Yang
2006-07-15 12:53:53 +0800
c38c8db72 [PATCH] ia64: race flushing icache in COW path ... Browse Code »

There is a race condition that showed up in a threaded JIT environment.
The situation is that a process with a JIT code page forks, so the page is
marked read-only, then some threads are created in the child. One of the
threads attempts to add a new code block to the JIT page, so a
copy-on-write fault is taken, and the kernel allocates a new page, copies
the data, installs the new pte, and then calls lazy_mmu_prot_update() to
flush caches to make sure that the icache and dcache are in sync.
Unfortunately, the other thread runs right after the new pte is installed,
but before the caches have been flushed. It tries to execute some old JIT
code that was already in this page, but it sees some garbage in the i-cache
from the previous users of the new physical page.

Fix: we must make the caches consistent before installing the pte. This is
an ia64 only fix because lazy_mmu_prot_update() is a no-op on all other
architectures.

Signed-off-by: Anil Keshavamurthy
Signed-off-by: Tony Luck
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anil Keshavamurthy
2006-07-15 12:53:51 +0800
8757d5fa6 [PATCH] mm: fix oom roll-back of __vmalloc_area_node ... Browse Code »

__vunmap must not rely on area->nr_pages when picking the release methode
for area->pages. It may be too small when __vmalloc_area_node failed early
due to lacking memory. Instead, use a flag in vmstruct to differentiate.

Signed-off-by: Jan Kiszka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kiszka
2006-07-15 12:53:51 +0800

14 Jul, 2006

3 commits

fc818301a [PATCH] revert slab.c locking change ... Browse Code »

Chandra Seetharaman reported SLAB crashes caused by the slab.c lock
annotation patch. There is only one chunk of that patch that has a
material effect on the slab logic - this patch undoes that chunk.

This was confirmed to fix the slab problem by Chandra.

Signed-off-by: Ingo Molnar
Tested-by: Chandra Seetharaman
Signed-off-by: Linus Torvalds

Ingo Molnar
2006-07-14 06:38:43 +0800
f1aaee53f [PATCH] lockdep: annotate mm/slab.c ... Browse Code »

mm/slab.c uses nested locking when dealing with 'off-slab'
caches, in that case it allocates the slab header from the
(on-slab) kmalloc caches. Teach the lock validator about
this by putting all on-slab caches into a separate class.

this patch has no effect on non-lockdep kernels.

Signed-off-by: Arjan van de Ven
Signed-off-by: Ingo Molnar
Signed-off-by: Linus Torvalds

Arjan van de Ven
2006-07-14 03:02:44 +0800
873623dfa [PATCH] lockdep: undo mm/slab.c annotation ... Browse Code »

undo existing mm/slab.c lock-validator annotations, in preparation
of a new, less intrusive annotation patch.

Signed-off-by: Ingo Molnar
Signed-off-by: Arjan van de Ven
Signed-off-by: Linus Torvalds

Ingo Molnar
2006-07-14 03:02:44 +0800

11 Jul, 2006

5 commits

32dd66fce [PATCH] vmstat: export all_vm_events() ... Browse Code »

Add missing EXPORT_SYMBOL for all_vm_events(). Git commit
f8891e5e1f93a128c3900f82035e8541357896a7 caused this:

Building modules, stage 2.
MODPOST
WARNING: "all_vm_events" [arch/s390/appldata/appldata_mem.ko] undefined!
CC arch/s390/appldata/appldata_mem.mod.o

Cc: Christoph Lameter
Cc: Gerald Schaefer
Cc: Martin Schwidefsky
Signed-off-by: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Heiko Carstens
2006-07-11 04:24:18 +0800
b0d85c5c3 [PATCH] mm/mmzone.c: EXPORT_UNUSED_SYMBOL ... Browse Code »

This patch marks three unused exports as EXPORT_UNUSED_SYMBOL.

Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2006-07-11 04:24:17 +0800
26fc52367 [PATCH] mm/memory.c: EXPORT_UNUSED_SYMBOL ... Browse Code »

This patch marks an unused export as EXPORT_UNUSED_SYMBOL.

Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2006-07-11 04:24:17 +0800
6d46cc6b9 [PATCH] mm/bootmem.c: EXPORT_UNUSED_SYMBOL ... Browse Code »

This patch marks an unused export as EXPORT_UNUSED_SYMBOL.

Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2006-07-11 04:24:17 +0800
8f72e4028 [PATCH] fadvise: remove dead comments ... Browse Code »

Cc: "Michael Kerrisk"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2006-07-11 04:24:14 +0800

04 Jul, 2006

6 commits

36c8b5868 [PATCH] sched: cleanup, remove task_t, convert to struct task_struct ... Browse Code »

cleanup: remove task_t and convert all the uses to struct task_struct. I
introduced it for the scheduler anno and it was a mistake.

Conversion was mostly scripted, the result was reviewed and all
secondary whitespace and style impact (if any) was fixed up by hand.

Signed-off-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ingo Molnar
2006-07-04 06:27:11 +0800
2b2d5493e [PATCH] lockdep: annotate SLAB code ... Browse Code »

Teach special (recursive) locking code to the lock validator. Has no effect
on non-lockdep kernels.

Fix initialize-locks-via-memcpy assumptions.

Effects on non-lockdep kernels: the subclass nesting parameter is passed into
cache_free_alien() and __cache_free(), and turns one internal
kmem_cache_free() call into an open-coded __cache_free() call.

Signed-off-by: Ingo Molnar
Signed-off-by: Arjan van de Ven
Cc: Pekka Enberg
Cc: Christoph Lameter
Cc: Manfred Spraul
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ingo Molnar
2006-07-04 06:27:09 +0800
f20dc5f7c [PATCH] lockdep: annotate mm ... Browse Code »

Teach special (recursive) locking code to the lock validator. Has no effect
on non-lockdep kernels.

Signed-off-by: Ingo Molnar
Signed-off-by: Arjan van de Ven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ingo Molnar
2006-07-04 06:27:07 +0800
e4d919188 [PATCH] lockdep: locking init debugging improvement ... Browse Code »

Locking init improvement:

- introduce and use __SPIN_LOCK_UNLOCKED for array initializations,
to pass in the name string of locks, used by debugging

Signed-off-by: Ingo Molnar
Signed-off-by: Arjan van de Ven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ingo Molnar
2006-07-04 06:27:02 +0800
9a11b49a8 [PATCH] lockdep: better lock debugging ... Browse Code »

Generic lock debugging:

- generalized lock debugging framework. For example, a bug in one lock
subsystem turns off debugging in all lock subsystems.

- got rid of the caller address passing (__IP__/__IP_DECL__/etc.) from
the mutex/rtmutex debugging code: it caused way too much prototype
hackery, and lockdep will give the same information anyway.

- ability to do silent tests

- check lock freeing in vfree too.

- more finegrained debugging options, to allow distributions to
turn off more expensive debugging features.

There's no separate 'held mutexes' list anymore - but there's a 'held locks'
stack within lockdep, which unifies deadlock detection across all lock
classes. (this is independent of the lockdep validation stuff - lockdep first
checks whether we are holding a lock already)

Here are the current debugging options:

CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_LOCK_ALLOC=y

which do:

config DEBUG_MUTEXES
bool "Mutex debugging, basic checks"

config DEBUG_LOCK_ALLOC
bool "Detect incorrect freeing of live mutexes"

Signed-off-by: Ingo Molnar
Signed-off-by: Arjan van de Ven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ingo Molnar
2006-07-04 06:27:01 +0800
9614634fe [PATCH] ZVC/zone_reclaim: Leave 1% of unmapped pagecache pages for file I/O ... Browse Code »

It turns out that it is advantageous to leave a small portion of unmapped file
backed pages if all of a zone's pages (or almost all pages) are allocated and
so the page allocator has to go off-node.

This allows recently used file I/O buffers to stay on the node and
reduces the times that zone reclaim is invoked if file I/O occurs
when we run out of memory in a zone.

The problem is that zone reclaim runs too frequently when the page cache is
used for file I/O (read write and therefore unmapped pages!) alone and we have
almost all pages of the zone allocated. Zone reclaim may remove 32 unmapped
pages. File I/O will use these pages for the next read/write requests and the
unmapped pages increase. After the zone has filled up again zone reclaim will
remove it again after only 32 pages. This cycle is too inefficient and there
are potentially too many zone reclaim cycles.

With the 1% boundary we may still remove all unmapped pages for file I/O in
zone reclaim pass. However. it will take a large number of read and writes
to get back to 1% again where we trigger zone reclaim again.

The zone reclaim 2.6.16/17 does not show this behavior because we have a 30
second timeout.

[akpm@osdl.org: rename the /proc file and the variable]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-04 06:26:59 +0800

01 Jul, 2006

4 commits

22a3e233c Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
Remove obsolete #include
remove obsolete swsusp_encrypt
arch/arm26/Kconfig typos
Documentation/IPMI typos
Kconfig: Typos in net/sched/Kconfig
v9fs: do not include linux/version.h
Documentation/DocBook/mtdnand.tmpl: typo fixes
typo fixes: specfic -> specific
typo fixes in Documentation/networking/pktgen.txt
typo fixes: occuring -> occurring
typo fixes: infomation -> information
typo fixes: disadvantadge -> disadvantage
typo fixes: aquire -> acquire
typo fixes: mecanism -> mechanism
typo fixes: bandwith -> bandwidth
fix a typo in the RTC_CLASS help text
smb is no longer maintained

Manually merged trivial conflict in arch/um/kernel/vmlinux.lds.S

Linus Torvalds
2006-07-01 06:39:30 +0800
ed11d9eb2 [PATCH] slab: consolidate code to free slabs from freelist ... Browse Code »

Post and discussion:
http://marc.theaimsgroup.com/?t=115074342800003&r=1&w=2

Code in __shrink_node() duplicates code in cache_reap()

Add a new function drain_freelist that removes slabs with objects that are
already free and use that in various places.

This eliminates the __node_shrink() function and provides the interrupt
holdoff reduction from slab_free to code that used to call __node_shrink.

[akpm@osdl.org: build fixes]
Signed-off-by: Christoph Lameter
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:36 +0800
f8891e5e1 [PATCH] Light weight event counters ... Browse Code »

The remaining counters in page_state after the zoned VM counter patches
have been applied are all just for show in /proc/vmstat. They have no
essential function for the VM.

We use a simple increment of per cpu variables. In order to avoid the most
severe races we disable preempt. Preempt does not prevent the race between
an increment and an interrupt handler incrementing the same statistics
counter. However, that race is exceedingly rare, we may only loose one
increment or so and there is no requirement (at least not in kernel) that
the vm event counters have to be accurate.

In the non preempt case this results in a simple increment for each
counter. For many architectures this will be reduced by the compiler to a
single instruction. This single instruction is atomic for i386 and x86_64.
And therefore even the rare race condition in an interrupt is avoided for
both architectures in most cases.

The patchset also adds an off switch for embedded systems that allows a
building of linux kernels without these counters.

The implementation of these counters is through inline code that hopefully
results in only a single instruction increment instruction being emitted
(i386, x86_64) or in the increment being hidden though instruction
concurrency (EPIC architectures such as ia64 can get that done).

Benefits:
- VM event counter operations usually reduce to a single inline instruction
on i386 and x86_64.
- No interrupt disable, only preempt disable for the preempt case.
Preempt disable can also be avoided by moving the counter into a spinlock.
- Handling is similar to zoned VM counters.
- Simple and easily extendable.
- Can be omitted to reduce memory use for embedded use.

References:

RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=113512330605497&w=2
RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=114988082814934&w=2
local_t http://marc.theaimsgroup.com/?l=linux-kernel&m=114991748606690&w=2
V2 http://marc.theaimsgroup.com/?t=115014808400007&r=1&w=2
V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767022346&w=2
V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115047968808926&w=2

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:36 +0800
ca889e6c4 [PATCH] Use Zoned VM Counters for NUMA statistics ... Browse Code »

The numa statistics are really event counters. But they are per node and
so we have had special treatment for these counters through additional
fields on the pcp structure. We can now use the per zone nature of the
zoned VM counters to realize these.

This will shrink the size of the pcp structure on NUMA systems. We will
have some room to add additional per zone counters that will all still fit
in the same cacheline.

Bits Prior pcp size Size after patch We can add
------------------------------------------------------------------
64 128 bytes (16 words) 80 bytes (10 words) 48
32 76 bytes (19 words) 56 bytes (14 words) 8 (64 byte cacheline)
72 (128 byte)

Remove the special statistics for numa and replace them with zoned vm
counters. This has the side effect that global sums of these events now
show up in /proc/vmstat.

Also take the opportunity to move the zone_statistics() function from
page_alloc.c into vmstat.c.

Discussions:
V2 http://marc.theaimsgroup.com/?t=115048227000002&r=1&w=2

Signed-off-by: Christoph Lameter
Acked-by: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:36 +0800