Doug / smarc-fsl-linux-kernel | Embedian Git Server

01 Jul, 2006

40 commits

131895251 [PATCH] EDAC: probe1 cleanup 1-of-2 ... Browse Code »

- Add lower-level functions that handle various parts of the initialization
done by the xxx_probe1() functions. Some of the xxx_probe1() functions are
much too long and complicated (see "Chapter 5: Functions" in
Documentation/CodingStyle).

- Cleanup of probe1() functions in EDAC

Signed-off-by: Doug Thompson
Cc: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Doug Thompson
2006-07-01 02:25:39 +0800
2d7bbb91c [PATCH] EDAC: mc numbers refactor 1-of-2 ... Browse Code »

Remove add_mc_to_global_list(). In next patch, this function will be
reimplemented with different semantics.

1 Reimplement add_mc_to_global_list() with semantics that allow the caller to
determine the ID number for a mem_ctl_info structure. Then modify
edac_mc_add_mc() so that the caller specifies the ID number for the new
mem_ctl_info structure. Platform-specific code should be able to assign the
ID numbers in a platform-specific manner. For instance, on Opteron it makes
sense to have the ID of the mem_ctl_info structure match the ID of the node
that the memory controller belongs to.

2 Modify callers of edac_mc_add_mc() so they use the new semantics.

Signed-off-by: Doug Thompson
Cc: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Doug Thompson
2006-07-01 02:25:39 +0800
37f04581a [PATCH] EDAC: PCI device to DEVICE cleanup ... Browse Code »

Change MC drivers from using CVS revision strings for their version number,
Now each driver has its own local string.

Remove some PCI dependencies from the core EDAC module. Made the code 'struct
device' centric instead of 'struct pci_dev' Most of the code changes here are
from a patch by Dave Jiang. It may be best to eventually move the
PCI-specific code into a separate source file.

Signed-off-by: Doug Thompson
Cc: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Doug Thompson
2006-07-01 02:25:39 +0800
dc474c891 [PATCH] Documentation/IPMI typos ... Browse Code »

Two typos in Documentation/IPMI.

Acked-by: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matt LaPlante
2006-07-01 02:25:39 +0800
7f04ac062 [PATCH] rcu: Add lock annotations to RCU locking primitives ... Browse Code »

Add __acquire annotations to rcu_read_lock and rcu_read_lock_bh, and add
__release annotations to rcu_read_unlock and rcu_read_unlock_bh. This
allows sparse to detect improperly paired calls to these functions.

Signed-off-by: Josh Triplett
Acked-by: Paul E. McKenney
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Josh Triplett
2006-07-01 02:25:39 +0800
7adc28ae7 [PATCH] drivers/cdrom/cm206.c: cleanups ... Browse Code »

- make __cm206_init() __init (required since it calls
the __init cm206_init())
- make the needlessly global bcdbin() static
- remove a comment with an obsolete compile command

Signed-off-by: Adrian Bunk
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2006-07-01 02:25:38 +0800
7b2baa1f8 [PATCH] show Acorn-specific block devices menu only when required ... Browse Code »

Don't show a menu that can't be entered due to lack of contents on arm (the
options are only available on arm26).

Signed-off-by: Adrian Bunk
Acked-by: Ian Molton
Cc: Russell King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2006-07-01 02:25:38 +0800
304228e29 [PATCH] Correct rtc_wkalrm comments ... Browse Code »

This corrects the comments describing the 'enabled' and 'pending' flags in
struct rtc_wkalrm of include/linux/rtc.h.

Signed-off-by: Andrew Victor
Cc: Alessandro Zummo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Victor
2006-07-01 02:25:38 +0800
e7b384043 [PATCH] cond_resched() fix ... Browse Code »

Fix a bug identified by Zou Nan hai :

If the system is in state SYSTEM_BOOTING, and need_resched() is true,
cond_resched() returns true even though it didn't reschedule. Consequently
need_resched() remains true and JBD locks up.

Fix that by teaching cond_resched() to only return true if it really did call
schedule().

cond_resched_lock() and cond_resched_softirq() have a problem too. If we're
in SYSTEM_BOOTING state and need_resched() is true, these functions will drop
the lock and will then try to call schedule(), but the SYSTEM_BOOTING state
will prevent schedule() from being called. So on return, need_resched() will
still be true, but cond_resched_lock() has to return 1 to tell the caller that
the lock was dropped. The caller will probably lock up.

Bottom line: if these functions dropped the lock, they _must_ call schedule()
to clear need_resched(). Make it so.

Also, uninline __cond_resched(). It's largeish, and slowpath.

Acked-by: Ingo Molnar
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2006-07-01 02:25:38 +0800
92fe15a3d [PATCH] uml: add __raw_writeq definition ... Browse Code »

The x86_64 build requires a definition for __raw_writeq.

Signed-off-by: Jeff Dike
Cc: Paolo 'Blaisorblade' Giarrusso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeff Dike
2006-07-01 02:25:38 +0800
ff23db537 [PATCH] uml: fix biarch gcc build on x86_64 ... Browse Code »

I run an x86_64 kernel with i386 userspace (Ubuntu Dapper) and decided to try
out UML today. I found that UML wasn't quite aware of biarch compilers (which
Ubuntu i386 ships). A fix similar to what was done for x86_64 should probably
be committed (see
http://marc.theaimsgroup.com/?l=linux-kernel&m=113425940204010&w=2). Without
the FLAGS changes, the build will fail at a number of places and without the
LINK change, the final link will fail.

Signed-off-by: Nishanth Aravamudan
Cc: Paolo 'Blaisorblade' Giarrusso
Cc: Sam Ravnborg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeff Dike
2006-07-01 02:25:38 +0800
d115ec0f0 [PATCH] uml: remove stray file ... Browse Code »

Forgot to remove arch/um/kernel/time.c when it was mostly moved to
arch/um/os-Linux.

Signed-off-by: Jeff Dike
Cc: Paolo 'Blaisorblade' Giarrusso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeff Dike
2006-07-01 02:25:38 +0800
30df49191 [PATCH] uml: remove unneeded time definitions ... Browse Code »

Remove um_time() and um_stime() syscalls since they are identical to
system-wide ones.

Signed-off-by: Paolo 'Blaisorblade' Giarrusso
Cc: Paolo 'Blaisorblade' Giarrusso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeff Dike
2006-07-01 02:25:38 +0800
572e61475 [PATCH] uml: add locking to xtime accesses ... Browse Code »

do_timer must be called with xtime_lock held. I'm not sure boot_timer_handler
needs this, however I don't think it hurts: it simply disables irq and takes a
spinlock.

Signed-off-by: Paolo 'Blaisorblade' Giarrusso
Cc: Paolo 'Blaisorblade' Giarrusso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeff Dike
2006-07-01 02:25:37 +0800
6edb08620 [PATCH] uml: unregister useless console when it's not needed ... Browse Code »

-mm in combination with an FC5 init started dying with 'stderr=1' because init
didn't like the lack of /dev/console and exited. The problem was that the
stderr console, which is intended to dump printk output to the terminal before
the regular console is initialized, isn't a tty, and so can't make
/dev/console operational.

However, since it is registered first, the normal console, when it is
registered, doesn't become the preferred console, and isn't attached to
/dev/console. Thus, /dev/console is never operational.

This patch makes the stderr console unregister itself in an initcall, which is
late enough that the normal console is registered. When that happens, the
normal console will become the preferred console and will be able to run
/dev/console.

Signed-off-by: Jeff Dike
Cc: Paolo 'Blaisorblade' Giarrusso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeff Dike
2006-07-01 02:25:37 +0800
190f49392 [PATCH] uml: fix off-by-one bug in VM file creation ... Browse Code »

Fix an off-by-one bug in temp file creation. Seeking to the desired length
and writing a byte resulted in the file being one byte longer than expected.

Signed-off-by: Jeff Dike
Cc: Paolo 'Blaisorblade' Giarrusso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeff Dike
2006-07-01 02:25:37 +0800
c2b7a4bb3 [PATCH] uml: fix /proc/mounts parsing boundary condition ... Browse Code »

When parsing /proc/mounts looking for a tmpfs mount on /dev/shm, if a string
that we are looking for if split across reads, then it won't be recognized.

Fix this by refilling the buffer whenever we advance the cursor.

Signed-off-by: Jeff Dike
Cc: Paolo 'Blaisorblade' Giarrusso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeff Dike
2006-07-01 02:25:37 +0800
dd673bca4 [PATCH] UML: fix the INIT_ENV_ARG_LIMIT dependencies ... Browse Code »

Fix the INIT_ENV_ARG_LIMIT dependencies to what seems to have been
intended.

Spotted by Jean-Luc Leger.

Signed-off-by: Adrian Bunk
Acked-by: Paolo 'Blaisorblade' Giarrusso
Cc: Jeff Dike
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2006-07-01 02:25:37 +0800
033ab7f8e [PATCH] add smp_setup_processor_id() ... Browse Code »

Presently, smp_processor_id() isn't necessarily set up until setup_arch().
But it's used in boot_cpu_init() and printk() and perhaps in other places,
prior to setup_arch() being called.

So provide a new smp_setup_processor_id() which is called before anything
else, wire it up for Voyager (which boots on a CPU other than #0, and broke).

Cc: James Bottomley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2006-07-01 02:25:37 +0800
a1836a42d [PATCH] SELinux: Add security hook definition for getioprio and insert hooks ... Browse Code »

Add a new security hook definition for the sys_ioprio_get operation. At
present, the SELinux hook function implementation for this hook is
identical to the getscheduler implementation but a separate hook is
introduced to allow this check to be specialized in the future if
necessary.

This patch also creates a helper function get_task_ioprio which handles the
access check in addition to retrieving the ioprio value for the task.

Signed-off-by: David Quigley
Acked-by: Stephen Smalley
Signed-off-by: James Morris
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Quigley
2006-07-01 02:25:37 +0800
7a01955f9 [PATCH] SELinux: update USB code with new kill_proc_info_as_uid ... Browse Code »

This patch updates the USB core to save and pass the sending task secid when
sending signals upon AIO completion so that proper security checking can be
applied by security modules.

Signed-off-by: David Quigley
Signed-off-by: James Morris
Cc: Stephen Smalley
Cc: Chris Wright
Signed-off-by: Greg Kroah-Hartman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Quigley
2006-07-01 02:25:37 +0800
8f95dc58d [PATCH] SELinux: add security hook call to kill_proc_info_as_uid ... Browse Code »

This patch adds a call to the extended security_task_kill hook introduced by
the prior patch to the kill_proc_info_as_uid function so that these signals
can be properly mediated by security modules. It also updates the existing
hook call in check_kill_permission.

Signed-off-by: David Quigley
Signed-off-by: James Morris
Cc: Stephen Smalley
Cc: Chris Wright
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Quigley
2006-07-01 02:25:37 +0800
f9008e4c5 [PATCH] SELinux: extend task_kill hook to handle signals sent by AIO completion ... Browse Code »

This patch extends the security_task_kill hook to handle signals sent by AIO
completion. In this case, the secid of the task responsible for the signal
needs to be obtained and saved earlier, so a security_task_getsecid() hook is
added, and then this saved value is passed subsequently to the extended
task_kill hook for use in checking.

Signed-off-by: David Quigley
Signed-off-by: James Morris
Cc: Stephen Smalley
Cc: Chris Wright
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Quigley
2006-07-01 02:25:36 +0800
ed11d9eb2 [PATCH] slab: consolidate code to free slabs from freelist ... Browse Code »

Post and discussion:
http://marc.theaimsgroup.com/?t=115074342800003&r=1&w=2

Code in __shrink_node() duplicates code in cache_reap()

Add a new function drain_freelist that removes slabs with objects that are
already free and use that in various places.

This eliminates the __node_shrink() function and provides the interrupt
holdoff reduction from slab_free to code that used to call __node_shrink.

[akpm@osdl.org: build fixes]
Signed-off-by: Christoph Lameter
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:36 +0800
f8891e5e1 [PATCH] Light weight event counters ... Browse Code »

The remaining counters in page_state after the zoned VM counter patches
have been applied are all just for show in /proc/vmstat. They have no
essential function for the VM.

We use a simple increment of per cpu variables. In order to avoid the most
severe races we disable preempt. Preempt does not prevent the race between
an increment and an interrupt handler incrementing the same statistics
counter. However, that race is exceedingly rare, we may only loose one
increment or so and there is no requirement (at least not in kernel) that
the vm event counters have to be accurate.

In the non preempt case this results in a simple increment for each
counter. For many architectures this will be reduced by the compiler to a
single instruction. This single instruction is atomic for i386 and x86_64.
And therefore even the rare race condition in an interrupt is avoided for
both architectures in most cases.

The patchset also adds an off switch for embedded systems that allows a
building of linux kernels without these counters.

The implementation of these counters is through inline code that hopefully
results in only a single instruction increment instruction being emitted
(i386, x86_64) or in the increment being hidden though instruction
concurrency (EPIC architectures such as ia64 can get that done).

Benefits:
- VM event counter operations usually reduce to a single inline instruction
on i386 and x86_64.
- No interrupt disable, only preempt disable for the preempt case.
Preempt disable can also be avoided by moving the counter into a spinlock.
- Handling is similar to zoned VM counters.
- Simple and easily extendable.
- Can be omitted to reduce memory use for embedded use.

References:

RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=113512330605497&w=2
RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=114988082814934&w=2
local_t http://marc.theaimsgroup.com/?l=linux-kernel&m=114991748606690&w=2
V2 http://marc.theaimsgroup.com/?t=115014808400007&r=1&w=2
V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767022346&w=2
V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115047968808926&w=2

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:36 +0800
ca889e6c4 [PATCH] Use Zoned VM Counters for NUMA statistics ... Browse Code »

The numa statistics are really event counters. But they are per node and
so we have had special treatment for these counters through additional
fields on the pcp structure. We can now use the per zone nature of the
zoned VM counters to realize these.

This will shrink the size of the pcp structure on NUMA systems. We will
have some room to add additional per zone counters that will all still fit
in the same cacheline.

Bits Prior pcp size Size after patch We can add
------------------------------------------------------------------
64 128 bytes (16 words) 80 bytes (10 words) 48
32 76 bytes (19 words) 56 bytes (14 words) 8 (64 byte cacheline)
72 (128 byte)

Remove the special statistics for numa and replace them with zoned vm
counters. This has the side effect that global sums of these events now
show up in /proc/vmstat.

Also take the opportunity to move the zone_statistics() function from
page_alloc.c into vmstat.c.

Discussions:
V2 http://marc.theaimsgroup.com/?t=115048227000002&r=1&w=2

Signed-off-by: Christoph Lameter
Acked-by: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:36 +0800
bab1846a0 [PATCH] zoned-vm-counters: remove read_page_state() ... Browse Code »

No callers.

Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2006-07-01 02:25:36 +0800
c24f21bda [PATCH] zoned vm counters: remove useless struct wbs ... Browse Code »

Remove writeback state

We can remove some functions now that were needed to calculate the page state
for writeback control since these statistics are now directly available.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:36 +0800
d2c5e30c9 [PATCH] zoned vm counters: conversion of nr_bounce to per zone counter ... Browse Code »

Conversion of nr_bounce to a per zone counter

nr_bounce is only used for proc output. So it could be left as an event
counter. However, the event counters may not be accurate and nr_bounce is
categorizing types of pages in a zone. So we really need this to also be a
per zone counter.

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:36 +0800
fd39fc856 [PATCH] zoned vm counters: conversion of nr_unstable to per zone counter ... Browse Code »

Conversion of nr_unstable to a per zone counter

We need to do some special modifications to the nfs code since there are
multiple cases of disposition and we need to have a page ref for proper
accounting.

This converts the last critical page state of the VM and therefore we need to
remove several functions that were depending on GET_PAGE_STATE_LAST in order
to make the kernel compile again. We are only left with event type counters
in page state.

[akpm@osdl.org: bugfixes]
Signed-off-by: Christoph Lameter
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:36 +0800
ce866b34a [PATCH] zoned vm counters: conversion of nr_writeback to per zone counter ... Browse Code »

Conversion of nr_writeback to per zone counter.

This removes the last page_state counter from arch/i386/mm/pgtable.c so we
drop the page_state from there.

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:35 +0800
b1e7a8fd8 [PATCH] zoned vm counters: conversion of nr_dirty to per zone counter ... Browse Code »

This makes nr_dirty a per zone counter. Looping over all processors is
avoided during writeback state determination.

The counter aggregation for nr_dirty had to be undone in the NFS layer since
we summed up the page counts from multiple zones. Someone more familiar with
NFS should probably review what I have done.

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:35 +0800
df849a152 [PATCH] zoned vm counters: conversion of nr_pagetables to per zone counter ... Browse Code »

Conversion of nr_page_table_pages to a per zone counter

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:35 +0800
9a865ffa3 [PATCH] zoned vm counters: conversion of nr_slab to per zone counter ... Browse Code »

- Allows reclaim to access counter without looping over processor counts.

- Allows accurate statistics on how many pages are used in a zone by
the slab. This may become useful to balance slab allocations over
various zones.

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:35 +0800
34aa1330f [PATCH] zoned vm counters: zone_reclaim: remove /proc/sys/vm/zone_reclaim_interval ... Browse Code »

The zone_reclaim_interval was necessary because we were not able to determine
how many unmapped pages exist in a zone. Therefore we had to scan in
intervals to figure out if any pages were unmapped.

With the zoned counters and NR_ANON_PAGES we now know the number of pagecache
pages and the number of mapped pages in a zone. So we can simply skip the
reclaim if there is an insufficient number of unmapped pages. We use
SWAP_CLUSTER_MAX as the boundary.

Drop all support for /proc/sys/vm/zone_reclaim_interval.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:35 +0800
f3dbd3446 [PATCH] zoned vm counters: split NR_ANON_PAGES off from NR_FILE_MAPPED ... Browse Code »

The current NR_FILE_MAPPED is used by zone reclaim and the dirty load
calculation as the number of mapped pagecache pages. However, that is not
true. NR_FILE_MAPPED includes the mapped anonymous pages. This patch
separates those and therefore allows an accurate tracking of the anonymous
pages per zone.

It then becomes possible to determine the number of unmapped pages per zone
and we can avoid scanning for unmapped pages if there are none.

Also it may now be possible to determine the mapped/unmapped ratio in
get_dirty_limit. Isnt the number of anonymous pages irrelevant in that
calculation?

Note that this will change the meaning of the number of mapped pages reported
in /proc/vmstat /proc/meminfo and in the per node statistics. This may affect
user space tools that monitor these counters! NR_FILE_MAPPED works like
NR_FILE_DIRTY. It is only valid for pagecache pages.

Signed-off-by: Christoph Lameter
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:35 +0800
bf02cf4b6 [PATCH] zoned vm counters: remove NR_FILE_MAPPED from scan control structure ... Browse Code »

We can now access the number of pages in a mapped state in an inexpensive way
in shrink_active_list. So drop the nr_mapped field from scan_control.

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:35 +0800
347ce434d [PATCH] zoned vm counters: conversion of nr_pagecache to per zone counter ... Browse Code »

Currently a single atomic variable is used to establish the size of the page
cache in the whole machine. The zoned VM counters have the same method of
implementation as the nr_pagecache code but also allow the determination of
the pagecache size per zone.

Remove the special implementation for nr_pagecache and make it a zoned counter
named NR_FILE_PAGES.

Updates of the page cache counters are always performed with interrupts off.
We can therefore use the __ variant here.

Signed-off-by: Christoph Lameter
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:34 +0800
65ba55f50 [PATCH] zoned vm counters: convert nr_mapped to per zone counter ... Browse Code »

nr_mapped is important because it allows a determination of how many pages of
a zone are not mapped, which would allow a more efficient means of determining
when we need to reclaim memory in a zone.

We take the nr_mapped field out of the page state structure and define a new
per zone counter named NR_FILE_MAPPED (the anonymous pages will be split off
from NR_MAPPED in the next patch).

We replace the use of nr_mapped in various kernel locations. This avoids the
looping over all processors in try_to_free_pages(), writeback, reclaim (swap +
zone reclaim).

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:34 +0800
2244b95a7 [PATCH] zoned vm counters: basic ZVC (zoned vm counter) implementation ... Browse Code »

Per zone counter infrastructure

The counters that we currently have for the VM are split per processor. The
processor however has not much to do with the zone these pages belong to. We
cannot tell f.e. how many ZONE_DMA pages are dirty.

So we are blind to potentially inbalances in the usage of memory in various
zones. F.e. in a NUMA system we cannot tell how many pages are dirty on a
particular node. If we knew then we could put measures into the VM to balance
the use of memory between different zones and different nodes in a NUMA
system. For example it would be possible to limit the dirty pages per node so
that fast local memory is kept available even if a process is dirtying huge
amounts of pages.

Another example is zone reclaim. We do not know how many unmapped pages exist
per zone. So we just have to try to reclaim. If it is not working then we
pause and try again later. It would be better if we knew when it makes sense
to reclaim unmapped pages from a zone. This patchset allows the determination
of the number of unmapped pages per zone. We can remove the zone reclaim
interval with the counters introduced here.

Futhermore the ability to have various usage statistics available will allow
the development of new NUMA balancing algorithms that may be able to improve
the decision making in the scheduler of when to move a process to another node
and hopefully will also enable automatic page migration through a user space
program that can analyse the memory load distribution and then rebalance
memory use in order to increase performance.

The counter framework here implements differential counters for each processor
in struct zone. The differential counters are consolidated when a threshold
is exceeded (like done in the current implementation for nr_pageache), when
slab reaping occurs or when a consolidation function is called.

Consolidation uses atomic operations and accumulates counters per zone in the
zone structure and also globally in the vm_stat array. VM functions can
access the counts by simply indexing a global or zone specific array.

The arrangement of counters in an array also simplifies processing when output
has to be generated for /proc/*.

Counters can be updated by calling inc/dec_zone_page_state or
_inc/dec_zone_page_state analogous to *_page_state. The second group of
functions can be called if it is known that interrupts are disabled.

Special optimized increment and decrement functions are provided. These can
avoid certain checks and use increment or decrement instructions that an
architecture may provide.

We also add a new CONFIG_DMA_IS_NORMAL that signifies that an architecture can
do DMA to all memory and therefore ZONE_NORMAL will not be populated. This is
only currently set for IA64 SGI SN2 and currently only affects
node_page_state(). In the best case node_page_state can be reduced to
retrieving a single counter for the one zone on the node.

[akpm@osdl.org: cleanups]
[akpm@osdl.org: export vm_stat[] for filesystems]
Signed-off-by: Christoph Lameter
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:34 +0800