12 Feb, 2007

7 commits

  • Make ZONE_DMA optional in core code.

    - ifdef all code for ZONE_DMA and related definitions following the example
    for ZONE_DMA32 and ZONE_HIGHMEM.

    - Without ZONE_DMA, ZONE_HIGHMEM and ZONE_DMA32 we get to a ZONES_SHIFT of
    0.

    - Modify the VM statistics to work correctly without a DMA zone.

    - Modify slab to not create DMA slabs if there is no ZONE_DMA.

    [akpm@osdl.org: cleanup]
    [jdike@addtoit.com: build fix]
    [apw@shadowen.org: Simplify calculation of the number of bits we need for ZONES_SHIFT]
    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Cc: "Luck, Tony"
    Cc: Kyle McMartin
    Cc: Matthew Wilcox
    Cc: James Bottomley
    Cc: Paul Mundt
    Signed-off-by: Andy Whitcroft
    Signed-off-by: Jeff Dike
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Values are available via ZVC sums.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Values are readily available via ZVC per node and global sums.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The global and per zone counter sums are in arrays of longs. Reorder the ZVCs
    so that the most frequently used ZVCs are put into the same cacheline. That
    way calculations of the global, node and per zone vm state touches only a
    single cacheline. This is mostly important for 64 bit systems were one 128
    byte cacheline takes only 8 longs.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This is again simplifies some of the VM counter calculations through the use
    of the ZVC consolidated counters.

    [michal.k.k.piotrowski@gmail.com: build fix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Michal Piotrowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The determination of the dirty ratio to determine writeback behavior is
    currently based on the number of total pages on the system.

    However, not all pages in the system may be dirtied. Thus the ratio is always
    too low and can never reach 100%. The ratio may be particularly skewed if
    large hugepage allocations, slab allocations or device driver buffers make
    large sections of memory not available anymore. In that case we may get into
    a situation in which f.e. the background writeback ratio of 40% cannot be
    reached anymore which leads to undesired writeback behavior.

    This patchset fixes that issue by determining the ratio based on the actual
    pages that may potentially be dirty. These are the pages on the active and
    the inactive list plus free pages.

    The problem with those counts has so far been that it is expensive to
    calculate these because counts from multiple nodes and multiple zones will
    have to be summed up. This patchset makes these counters ZVC counters. This
    means that a current sum per zone, per node and for the whole system is always
    available via global variables and not expensive anymore to calculate.

    The patchset results in some other good side effects:

    - Removal of the various functions that sum up free, active and inactive
    page counts

    - Cleanup of the functions that display information via the proc filesystem.

    This patch:

    The use of a ZVC for nr_inactive and nr_active allows a simplification of some
    counter operations. More ZVC functionality is used for sums etc in the
    following patches.

    [akpm@osdl.org: UP build fix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This early break prevents us from displaying info for the vm stats thresholds
    if the zone doesn't have any pages in its per-cpu pagesets.

    So my 800MB i386 box says:

    Node 0, zone DMA
    pages free 2365
    min 16
    low 20
    high 24
    active 0
    inactive 0
    scanned 0 (a: 0 i: 0)
    spanned 4096
    present 4044
    nr_anon_pages 0
    nr_mapped 1
    nr_file_pages 0
    nr_slab_reclaimable 0
    nr_slab_unreclaimable 0
    nr_page_table_pages 0
    nr_dirty 0
    nr_writeback 0
    nr_unstable 0
    nr_bounce 0
    nr_vmscan_write 0
    protection: (0, 868, 868)
    pagesets
    all_unreclaimable: 0
    prev_priority: 12
    start_pfn: 0
    Node 0, zone Normal
    pages free 199713
    min 934
    low 1167
    high 1401
    active 10215
    inactive 4507
    scanned 0 (a: 0 i: 0)
    spanned 225280
    present 222420
    nr_anon_pages 2685
    nr_mapped 1110
    nr_file_pages 12055
    nr_slab_reclaimable 2216
    nr_slab_unreclaimable 1527
    nr_page_table_pages 213
    nr_dirty 0
    nr_writeback 0
    nr_unstable 0
    nr_bounce 0
    nr_vmscan_write 0
    protection: (0, 0, 0)
    pagesets
    cpu: 0 pcp: 0
    count: 152
    high: 186
    batch: 31
    cpu: 0 pcp: 1
    count: 13
    high: 62
    batch: 15
    vm stats threshold: 16
    cpu: 1 pcp: 0
    count: 34
    high: 186
    batch: 31
    cpu: 1 pcp: 1
    count: 10
    high: 62
    batch: 15
    vm stats threshold: 16
    all_unreclaimable: 0
    prev_priority: 12
    start_pfn: 4096

    Just nuke all that search-for-the-first-non-empty-pageset code. Dunno why it
    was there in the first place..

    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

08 Dec, 2006

2 commits

  • - move some file_operations structs into the .rodata section

    - move static strings from policy_types[] array into the .rodata section

    - fix generic seq_operations usages, so that those structs may be defined
    as "const" as well

    [akpm@osdl.org: couple of fixes]
    Signed-off-by: Helge Deller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Helge Deller
     
  • These patches introduced new switch statements which are indented contrary
    to the concensus in mm/*.c. Fix them up to match that concensus.

    [PATCH] node local per-cpu-pages
    [PATCH] ZVC: Scale thresholds depending on the size of the system
    commit e7c8d5c9955a4d2e88e36b640563f5d6d5aba48a
    commit df9ecaba3f152d1ea79f2a5e0b87505e03f47590

    Signed-off-by: Andy Whitcroft
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     

29 Oct, 2006

1 commit

  • The temp_priority field in zone is racy, as we can walk through a reclaim
    path, and just before we copy it into prev_priority, it can be overwritten
    (say with DEF_PRIORITY) by another reclaimer.

    The same bug is contained in both try_to_free_pages and balance_pgdat, but
    it is fixed slightly differently. In balance_pgdat, we keep a separate
    priority record per zone in a local array. In try_to_free_pages there is
    no need to do this, as the priority level is the same for all zones that we
    reclaim from.

    Impact of this bug is that temp_priority is copied into prev_priority, and
    setting this artificially high causes reclaimers to set distress
    artificially low. They then fail to reclaim mapped pages, when they are,
    in fact, under severe memory pressure (their priority may be as low as 0).
    This causes the OOM killer to fire incorrectly.

    From: Andrew Morton

    __zone_reclaim() isn't modifying zone->prev_priority. But zone->prev_priority
    is used in the decision whether or not to bring mapped pages onto the inactive
    list. Hence there's a risk here that __zone_reclaim() will fail because
    zone->prev_priority ir large (ie: low urgency) and lots of mapped pages end up
    stuck on the active list.

    Fix that up by decreasing (ie making more urgent) zone->prev_priority as
    __zone_reclaim() scans the zone's pages.

    This bug perhaps explains why ZONE_RECLAIM_PRIORITY was created. It should be
    possible to remove that now, and to just start out at DEF_PRIORITY?

    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Bligh
     

04 Oct, 2006

1 commit


27 Sep, 2006

2 commits

  • Now that we have the node in the hot zone of struct zone we can avoid
    accessing zone_pgdat in zone_statistics.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The VM is supposed to minimise the number of pages which get written off the
    LRU (for IO scheduling efficiency, and for high reclaim-success rates). But
    we don't actually have a clear way of showing how true this is.

    So add `nr_vmscan_write' to /proc/vmstat and /proc/zoneinfo - the number of
    pages which have been written by the vm scanner in this zone and globally.

    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

26 Sep, 2006

3 commits


02 Sep, 2006

2 commits

  • The ZVC counter update threshold is currently set to a fixed value of 32.
    This patch sets up the threshold depending on the number of processors and
    the sizes of the zones in the system.

    With the current threshold of 32, I was able to observe slight contention
    when more than 130-140 processors concurrently updated the counters. The
    contention vanished when I either increased the threshold to 64 or used
    Andrew's idea of overstepping the interval (see ZVC overstep patch).

    However, we saw contention again at 220-230 processors. So we need higher
    values for larger systems.

    But the current default is already a bit of an overkill for smaller
    systems. Some systems have tiny zones where precision matters. For
    example i386 and x86_64 have 16M DMA zones and either 900M ZONE_NORMAL or
    ZONE_DMA32. These are even present on SMP and NUMA systems.

    The patch here sets up a threshold based on the number of processors in the
    system and the size of the zone that these counters are used for. The
    threshold should grow logarithmically, so we use fls() as an easy
    approximation.

    Results of tests on a system with 1024 processors (4TB RAM)

    The following output is from a test allocating 1GB of memory concurrently
    on each processor (Forking the process. So contention on mmap_sem and the
    pte locks is not a factor):

    X MIN
    TYPE: CPUS WALL WALL SYS USER TOTCPU
    fork 1 0.552 0.552 0.540 0.012 0.552
    fork 4 0.552 0.548 2.164 0.036 2.200
    fork 16 0.564 0.548 8.812 0.164 8.976
    fork 128 0.580 0.572 72.204 1.208 73.412
    fork 256 1.300 0.660 310.400 2.160 312.560
    fork 512 3.512 0.696 1526.836 4.816 1531.652
    fork 1020 20.024 0.700 17243.176 6.688 17249.863

    So a threshold of 32 is fine up to 128 processors. At 256 processors contention
    becomes a factor.

    Overstepping the counter (earlier patch) improves the numbers a bit:

    fork 4 0.552 0.548 2.164 0.040 2.204
    fork 16 0.552 0.548 8.640 0.148 8.788
    fork 128 0.556 0.548 69.676 0.956 70.632
    fork 256 0.876 0.636 212.468 2.108 214.576
    fork 512 2.276 0.672 997.324 4.260 1001.584
    fork 1020 13.564 0.680 11586.436 6.088 11592.523

    Still contention at 512 and 1020. Contention at 1020 is down by a third.
    256 still has a slight bit of contention.

    After this patch the counter threshold will be set to 125 which reduces
    contention significantly:

    fork 128 0.560 0.548 69.776 0.932 70.708
    fork 256 0.636 0.556 143.460 2.036 145.496
    fork 512 0.640 0.548 284.244 4.236 288.480
    fork 1020 1.500 0.588 1326.152 8.892 1335.044

    [akpm@osdl.org: !SMP build fix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Increments and decrements are usually grouped rather than mixed. We can
    optimize the inc and dec functions for that case.

    Increment and decrement the counters by 50% more than the threshold in
    those cases and set the differential accordingly. This decreases the need
    to update the atomic counters.

    The idea came originally from Andrew Morton. The overstepping alone was
    sufficient to address the contention issue found when updating the global
    and the per zone counters from 160 processors.

    Also remove some code in dec_zone_page_state.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

11 Jul, 2006

1 commit

  • Add missing EXPORT_SYMBOL for all_vm_events(). Git commit
    f8891e5e1f93a128c3900f82035e8541357896a7 caused this:

    Building modules, stage 2.
    MODPOST
    WARNING: "all_vm_events" [arch/s390/appldata/appldata_mem.ko] undefined!
    CC arch/s390/appldata/appldata_mem.mod.o

    Cc: Christoph Lameter
    Cc: Gerald Schaefer
    Cc: Martin Schwidefsky
    Signed-off-by: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     

01 Jul, 2006

14 commits

  • The remaining counters in page_state after the zoned VM counter patches
    have been applied are all just for show in /proc/vmstat. They have no
    essential function for the VM.

    We use a simple increment of per cpu variables. In order to avoid the most
    severe races we disable preempt. Preempt does not prevent the race between
    an increment and an interrupt handler incrementing the same statistics
    counter. However, that race is exceedingly rare, we may only loose one
    increment or so and there is no requirement (at least not in kernel) that
    the vm event counters have to be accurate.

    In the non preempt case this results in a simple increment for each
    counter. For many architectures this will be reduced by the compiler to a
    single instruction. This single instruction is atomic for i386 and x86_64.
    And therefore even the rare race condition in an interrupt is avoided for
    both architectures in most cases.

    The patchset also adds an off switch for embedded systems that allows a
    building of linux kernels without these counters.

    The implementation of these counters is through inline code that hopefully
    results in only a single instruction increment instruction being emitted
    (i386, x86_64) or in the increment being hidden though instruction
    concurrency (EPIC architectures such as ia64 can get that done).

    Benefits:
    - VM event counter operations usually reduce to a single inline instruction
    on i386 and x86_64.
    - No interrupt disable, only preempt disable for the preempt case.
    Preempt disable can also be avoided by moving the counter into a spinlock.
    - Handling is similar to zoned VM counters.
    - Simple and easily extendable.
    - Can be omitted to reduce memory use for embedded use.

    References:

    RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=113512330605497&w=2
    RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=114988082814934&w=2
    local_t http://marc.theaimsgroup.com/?l=linux-kernel&m=114991748606690&w=2
    V2 http://marc.theaimsgroup.com/?t=115014808400007&r=1&w=2
    V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767022346&w=2
    V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115047968808926&w=2

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The numa statistics are really event counters. But they are per node and
    so we have had special treatment for these counters through additional
    fields on the pcp structure. We can now use the per zone nature of the
    zoned VM counters to realize these.

    This will shrink the size of the pcp structure on NUMA systems. We will
    have some room to add additional per zone counters that will all still fit
    in the same cacheline.

    Bits Prior pcp size Size after patch We can add
    ------------------------------------------------------------------
    64 128 bytes (16 words) 80 bytes (10 words) 48
    32 76 bytes (19 words) 56 bytes (14 words) 8 (64 byte cacheline)
    72 (128 byte)

    Remove the special statistics for numa and replace them with zoned vm
    counters. This has the side effect that global sums of these events now
    show up in /proc/vmstat.

    Also take the opportunity to move the zone_statistics() function from
    page_alloc.c into vmstat.c.

    Discussions:
    V2 http://marc.theaimsgroup.com/?t=115048227000002&r=1&w=2

    Signed-off-by: Christoph Lameter
    Acked-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • No callers.

    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Conversion of nr_bounce to a per zone counter

    nr_bounce is only used for proc output. So it could be left as an event
    counter. However, the event counters may not be accurate and nr_bounce is
    categorizing types of pages in a zone. So we really need this to also be a
    per zone counter.

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Conversion of nr_unstable to a per zone counter

    We need to do some special modifications to the nfs code since there are
    multiple cases of disposition and we need to have a page ref for proper
    accounting.

    This converts the last critical page state of the VM and therefore we need to
    remove several functions that were depending on GET_PAGE_STATE_LAST in order
    to make the kernel compile again. We are only left with event type counters
    in page state.

    [akpm@osdl.org: bugfixes]
    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Conversion of nr_writeback to per zone counter.

    This removes the last page_state counter from arch/i386/mm/pgtable.c so we
    drop the page_state from there.

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This makes nr_dirty a per zone counter. Looping over all processors is
    avoided during writeback state determination.

    The counter aggregation for nr_dirty had to be undone in the NFS layer since
    we summed up the page counts from multiple zones. Someone more familiar with
    NFS should probably review what I have done.

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Conversion of nr_page_table_pages to a per zone counter

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • - Allows reclaim to access counter without looping over processor counts.

    - Allows accurate statistics on how many pages are used in a zone by
    the slab. This may become useful to balance slab allocations over
    various zones.

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The current NR_FILE_MAPPED is used by zone reclaim and the dirty load
    calculation as the number of mapped pagecache pages. However, that is not
    true. NR_FILE_MAPPED includes the mapped anonymous pages. This patch
    separates those and therefore allows an accurate tracking of the anonymous
    pages per zone.

    It then becomes possible to determine the number of unmapped pages per zone
    and we can avoid scanning for unmapped pages if there are none.

    Also it may now be possible to determine the mapped/unmapped ratio in
    get_dirty_limit. Isnt the number of anonymous pages irrelevant in that
    calculation?

    Note that this will change the meaning of the number of mapped pages reported
    in /proc/vmstat /proc/meminfo and in the per node statistics. This may affect
    user space tools that monitor these counters! NR_FILE_MAPPED works like
    NR_FILE_DIRTY. It is only valid for pagecache pages.

    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Currently a single atomic variable is used to establish the size of the page
    cache in the whole machine. The zoned VM counters have the same method of
    implementation as the nr_pagecache code but also allow the determination of
    the pagecache size per zone.

    Remove the special implementation for nr_pagecache and make it a zoned counter
    named NR_FILE_PAGES.

    Updates of the page cache counters are always performed with interrupts off.
    We can therefore use the __ variant here.

    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • nr_mapped is important because it allows a determination of how many pages of
    a zone are not mapped, which would allow a more efficient means of determining
    when we need to reclaim memory in a zone.

    We take the nr_mapped field out of the page state structure and define a new
    per zone counter named NR_FILE_MAPPED (the anonymous pages will be split off
    from NR_MAPPED in the next patch).

    We replace the use of nr_mapped in various kernel locations. This avoids the
    looping over all processors in try_to_free_pages(), writeback, reclaim (swap +
    zone reclaim).

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Per zone counter infrastructure

    The counters that we currently have for the VM are split per processor. The
    processor however has not much to do with the zone these pages belong to. We
    cannot tell f.e. how many ZONE_DMA pages are dirty.

    So we are blind to potentially inbalances in the usage of memory in various
    zones. F.e. in a NUMA system we cannot tell how many pages are dirty on a
    particular node. If we knew then we could put measures into the VM to balance
    the use of memory between different zones and different nodes in a NUMA
    system. For example it would be possible to limit the dirty pages per node so
    that fast local memory is kept available even if a process is dirtying huge
    amounts of pages.

    Another example is zone reclaim. We do not know how many unmapped pages exist
    per zone. So we just have to try to reclaim. If it is not working then we
    pause and try again later. It would be better if we knew when it makes sense
    to reclaim unmapped pages from a zone. This patchset allows the determination
    of the number of unmapped pages per zone. We can remove the zone reclaim
    interval with the counters introduced here.

    Futhermore the ability to have various usage statistics available will allow
    the development of new NUMA balancing algorithms that may be able to improve
    the decision making in the scheduler of when to move a process to another node
    and hopefully will also enable automatic page migration through a user space
    program that can analyse the memory load distribution and then rebalance
    memory use in order to increase performance.

    The counter framework here implements differential counters for each processor
    in struct zone. The differential counters are consolidated when a threshold
    is exceeded (like done in the current implementation for nr_pageache), when
    slab reaping occurs or when a consolidation function is called.

    Consolidation uses atomic operations and accumulates counters per zone in the
    zone structure and also globally in the vm_stat array. VM functions can
    access the counts by simply indexing a global or zone specific array.

    The arrangement of counters in an array also simplifies processing when output
    has to be generated for /proc/*.

    Counters can be updated by calling inc/dec_zone_page_state or
    _inc/dec_zone_page_state analogous to *_page_state. The second group of
    functions can be called if it is known that interrupts are disabled.

    Special optimized increment and decrement functions are provided. These can
    avoid certain checks and use increment or decrement instructions that an
    architecture may provide.

    We also add a new CONFIG_DMA_IS_NORMAL that signifies that an architecture can
    do DMA to all memory and therefore ZONE_NORMAL will not be populated. This is
    only currently set for IA64 SGI SN2 and currently only affects
    node_page_state(). In the best case node_page_state can be reduced to
    retrieving a single counter for the one zone on the node.

    [akpm@osdl.org: cleanups]
    [akpm@osdl.org: export vm_stat[] for filesystems]
    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • NOTE: ZVC are *not* the lightweight event counters. ZVCs are reliable whereas
    event counters do not need to be.

    Zone based VM statistics are necessary to be able to determine what the state
    of memory in one zone is. In a NUMA system this can be helpful for local
    reclaim and other memory optimizations that may be able to shift VM load in
    order to get more balanced memory use.

    It is also useful to know how the computing load affects the memory
    allocations on various zones. This patchset allows the retrieval of that data
    from userspace.

    The patchset introduces a framework for counters that is a cross between the
    existing page_stats --which are simply global counters split per cpu-- and the
    approach of deferred incremental updates implemented for nr_pagecache.

    Small per cpu 8 bit counters are added to struct zone. If the counter exceeds
    certain thresholds then the counters are accumulated in an array of
    atomic_long in the zone and in a global array that sums up all zone values.
    The small 8 bit counters are next to the per cpu page pointers and so they
    will be in high in the cpu cache when pages are allocated and freed.

    Access to VM counter information for a zone and for the whole machine is then
    possible by simply indexing an array (Thanks to Nick Piggin for pointing out
    that approach). The access to the total number of pages of various types does
    no longer require the summing up of all per cpu counters.

    Benefits of this patchset right now:

    - Ability for UP and SMP configuration to determine how memory
    is balanced between the DMA, NORMAL and HIGHMEM zones.

    - loops over all processors are avoided in writeback and
    reclaim paths. We can avoid caching the writeback information
    because the needed information is directly accessible.

    - Special handling for nr_pagecache removed.

    - zone_reclaim_interval vanishes since VM stats can now determine
    when it is worth to do local reclaim.

    - Fast inline per node page state determination.

    - Accurate counters in /sys/devices/system/node/node*/meminfo. Current
    counters are counting simply which processor allocated a page somewhere
    and guestimate based on that. So the counters were not useful to show
    the actual distribution of page use on a specific zone.

    - The swap_prefetch patch requires per node statistics in order to
    figure out when processors of a node can prefetch. This patch provides
    some of the needed numbers.

    - Detailed VM counters available in more /proc and /sys status files.

    References to earlier discussions:
    V1 http://marc.theaimsgroup.com/?l=linux-kernel&m=113511649910826&w=2
    V2 http://marc.theaimsgroup.com/?l=linux-kernel&m=114980851924230&w=2
    V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115014697910351&w=2
    V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767318740&w=2

    Performance tests with AIM7 did not show any regressions. Seems to be a tad
    faster even. Tested on ia64/NUMA. Builds fine on i386, SMP / UP. Includes
    fixes for s390/arm/uml arch code.

    This patch:

    Move counter code from page_alloc.c/page-flags.h to vmstat.c/h.

    Create vmstat.c/vmstat.h by separating the counter code and the proc
    functions.

    Move the vm_stat_text array before zoneinfo_show.

    [akpm@osdl.org: s390 build fix]
    [akpm@osdl.org: HOTPLUG_CPU build fix]
    Signed-off-by: Christoph Lameter
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter