25 May, 2010

2 commits

  • Ordinarily when a high-order allocation fails, direct reclaim is entered
    to free pages to satisfy the allocation. With this patch, it is
    determined if an allocation failed due to external fragmentation instead
    of low memory and if so, the calling process will compact until a suitable
    page is freed. Compaction by moving pages in memory is considerably
    cheaper than paging out to disk and works where there are locked pages or
    no swap. If compaction fails to free a page of a suitable size, then
    reclaim will still occur.

    Direct compaction returns as soon as possible. As each block is
    compacted, it is checked if a suitable page has been freed and if so, it
    returns.

    [akpm@linux-foundation.org: Fix build errors]
    [aarcange@redhat.com: fix count_vm_event preempt in memory compaction direct reclaim]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch is the core of a mechanism which compacts memory in a zone by
    relocating movable pages towards the end of the zone.

    A single compaction run involves a migration scanner and a free scanner.
    Both scanners operate on pageblock-sized areas in the zone. The migration
    scanner starts at the bottom of the zone and searches for all movable
    pages within each area, isolating them onto a private list called
    migratelist. The free scanner starts at the top of the zone and searches
    for suitable areas and consumes the free pages within making them
    available for the migration scanner. The pages isolated for migration are
    then migrated to the newly isolated free pages.

    [aarcange@redhat.com: Fix unsafe optimisation]
    [mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

05 Jan, 2010

1 commit


16 Dec, 2009

2 commits

  • If reclaim fails to make sufficient progress, the priority is raised.
    Once the priority is higher, kswapd starts waiting on congestion.
    However, if the zone is below the min watermark then kswapd needs to
    continue working without delay as there is a danger of an increased rate
    of GFP_ATOMIC allocation failure.

    This patch changes the conditions under which kswapd waits on congestion
    by only going to sleep if the min watermarks are being met.

    [mel@csn.ul.ie: add stats to track how relevant the logic is]
    [mel@csn.ul.ie: make kswapd only check its own zones and rename the relevant counters]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • After kswapd balances all zones in a pgdat, it goes to sleep. In the
    event of no IO congestion, kswapd can go to sleep very shortly after the
    high watermark was reached. If there are a constant stream of allocations
    from parallel processes, it can mean that kswapd went to sleep too quickly
    and the high watermark is not being maintained for sufficient length time.

    This patch makes kswapd go to sleep as a two-stage process. It first
    tries to sleep for HZ/10. If it is woken up by another process or the
    high watermark is no longer met, it's considered a premature sleep and
    kswapd continues work. Otherwise it goes fully to sleep.

    This adds more counters to distinguish between fast and slow breaches of
    watermarks. A "fast" premature sleep is one where the low watermark was
    hit in a very short time after kswapd going to sleep. A "slow" premature
    sleep indicates that the high watermark was breached after a very short
    interval.

    Signed-off-by: Mel Gorman
    Cc: Frans Pop
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

29 Oct, 2009

1 commit

  • Now that the return from alloc_percpu is compatible with the address
    of per-cpu vars, it makes sense to hand around the address of per-cpu
    variables. To make this sane, we remove the per_cpu__ prefix we used
    created to stop people accidentally using these vars directly.

    Now we have sparse, we can use that (next patch).

    tj: * Updated to convert stuff which were missed by or added after the
    original patch.

    * Kill per_cpu_var() macro.

    Signed-off-by: Rusty Russell
    Signed-off-by: Tejun Heo
    Reviewed-by: Christoph Lameter

    Rusty Russell
     

03 Oct, 2009

1 commit


22 Sep, 2009

2 commits

  • global_lru_pages() / zone_lru_pages() can be used in two ways:
    - to estimate max reclaimable pages in determine_dirtyable_memory()
    - to calculate the slab scan ratio

    When swap is full or not present, the anon lru lists are not reclaimable
    and also won't be scanned. So the anon pages shall not be counted in both
    usage scenarios. Also rename to _reclaimable_pages: now they are counting
    the possibly reclaimable lru pages.

    It can greatly (and correctly) increase the slab scan rate under high
    memory pressure (when most file pages have been reclaimed and swap is
    full/absent), thus reduce false OOM kills.

    Acked-by: Peter Zijlstra
    Reviewed-by: Rik van Riel
    Reviewed-by: Christoph Lameter
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Signed-off-by: Wu Fengguang
    Acked-by: Johannes Weiner
    Reviewed-by: Minchan Kim
    Reviewed-by: Jesse Barnes
    Cc: David Howells
    Cc: "Li, Ming Chun"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • __add_zone_page_state() and __sub_zone_page_state() are unused.

    Signed-off-by: KOSAKI Motohiro
    Cc: Wu Fengguang
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

17 Jun, 2009

2 commits

  • On NUMA machines, the administrator can configure zone_reclaim_mode that
    is a more targetted form of direct reclaim. On machines with large NUMA
    distances for example, a zone_reclaim_mode defaults to 1 meaning that
    clean unmapped pages will be reclaimed if the zone watermarks are not
    being met.

    There is a heuristic that determines if the scan is worthwhile but it is
    possible that the heuristic will fail and the CPU gets tied up scanning
    uselessly. Detecting the situation requires some guesswork and
    experimentation so this patch adds a counter "zreclaim_failed" to
    /proc/vmstat. If during high CPU utilisation this counter is increasing
    rapidly, then the resolution to the problem may be to set
    /proc/sys/vm/zone_reclaim_mode to 0.

    [akpm@linux-foundation.org: name things consistently]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Cc: Wu Fengguang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently, nobody wants to turn UNEVICTABLE_LRU off. Thus this
    configurability is unnecessary.

    Signed-off-by: KOSAKI Motohiro
    Cc: Johannes Weiner
    Cc: Andi Kleen
    Acked-by: Minchan Kim
    Cc: David Woodhouse
    Cc: Matt Mackall
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

23 Oct, 2008

4 commits


20 Oct, 2008

4 commits

  • Allow free of mlock()ed pages. This shouldn't happen, but during
    developement, it occasionally did.

    This patch allows us to survive that condition, while keeping the
    statistics and events correct for debug.

    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Add NR_MLOCK zone page state, which provides a (conservative) count of
    mlocked pages (actually, the number of mlocked pages moved off the LRU).

    Reworked by lts to fit in with the modified mlock page support in the
    Reclaim Scalability series.

    [kosaki.motohiro@jp.fujitsu.com: fix incorrect Mlocked field of /proc/meminfo]
    [lee.schermerhorn@hp.com: mlocked-pages: add event counting with statistics]
    Signed-off-by: Nick Piggin
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Fix to unevictable-lru-page-statistics.patch

    Add unevictable lru infrastructure vm events to the statistics patch.
    Rename the "NORECL_" and "noreclaim_" symbols and text strings to
    "UNEVICTABLE_" and "unevictable_", respectively.

    Currently, both the infrastructure and the mlocked pages event are
    added by a single patch later in the series. This makes it difficult
    to add or rework the incremental patches. The events actually "belong"
    with the stats, so pull them up to here.

    Also, restore the event counting to putback_lru_page(). This was removed
    from previous patch in series where it was "misplaced". The actual events
    weren't defined that early.

    Signed-off-by: Lee Schermerhorn
    Cc: Rik van Riel
    Reviewed-by: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Split the LRU lists in two, one set for pages that are backed by real file
    systems ("file") and one for pages that are backed by memory and swap
    ("anon"). The latter includes tmpfs.

    The advantage of doing this is that the VM will not have to scan over lots
    of anonymous pages (which we generally do not want to swap out), just to
    find the page cache pages that it should evict.

    This patch has the infrastructure and a basic policy to balance how much
    we scan the anon lists and how much we scan the file lists. The big
    policy changes are in separate patches.

    [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
    [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
    [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
    [hugh@veritas.com: memcg swapbacked pages active]
    [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
    [akpm@linux-foundation.org: fix /proc/vmstat units]
    [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
    [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
    [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

25 Jul, 2008

1 commit


28 Apr, 2008

2 commits

  • Allocating huge pages directly from the buddy allocator is not guaranteed to
    succeed. Success depends on several factors (such as the amount of physical
    memory available and the level of fragmentation). With the addition of
    dynamic hugetlb pool resizing, allocations can occur much more frequently.
    For these reasons it is desirable to keep track of huge page allocation
    successes and failures.

    Add two new vmstat entries to track huge page allocations that succeed and
    fail. The presence of the two entries is contingent upon CONFIG_HUGETLB_PAGE
    being enabled.

    [akpm@linux-foundation.org: reduced ifdeffery]
    Signed-off-by: Adam Litke
    Signed-off-by: Eric Munson
    Tested-by: Mel Gorman
    Reviewed-by: Andy Whitcroft
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • On NUMA, zone_statistics() is used to record events like numa hit, miss and
    foreign. It assumes that the first zone in a zonelist is the preferred zone.
    When multiple zonelists are replaced by one that is filtered, this is no
    longer the case.

    This patch records what the preferred zone is rather than assuming the first
    zone in the zonelist is it. This simplifies the reading of later patches in
    this set.

    Signed-off-by: Mel Gorman
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Reviewed-by: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

01 Mar, 2008

1 commit


18 Jul, 2007

1 commit

  • The following 8 patches against 2.6.20-mm2 create a zone called ZONE_MOVABLE
    that is only usable by allocations that specify both __GFP_HIGHMEM and
    __GFP_MOVABLE. This has the effect of keeping all non-movable pages within a
    single memory partition while allowing movable allocations to be satisfied
    from either partition. The patches may be applied with the list-based
    anti-fragmentation patches that groups pages together based on mobility.

    The size of the zone is determined by a kernelcore= parameter specified at
    boot-time. This specifies how much memory is usable by non-movable
    allocations and the remainder is used for ZONE_MOVABLE. Any range of pages
    within ZONE_MOVABLE can be released by migrating the pages or by reclaiming.

    When selecting a zone to take pages from for ZONE_MOVABLE, there are two
    things to consider. First, only memory from the highest populated zone is
    used for ZONE_MOVABLE. On the x86, this is probably going to be ZONE_HIGHMEM
    but it would be ZONE_DMA on ppc64 or possibly ZONE_DMA32 on x86_64. Second,
    the amount of memory usable by the kernel will be spread evenly throughout
    NUMA nodes where possible. If the nodes are not of equal size, the amount of
    memory usable by the kernel on some nodes may be greater than others.

    By default, the zone is not as useful for hugetlb allocations because they are
    pinned and non-migratable (currently at least). A sysctl is provided that
    allows huge pages to be allocated from that zone. This means that the huge
    page pool can be resized to the size of ZONE_MOVABLE during the lifetime of
    the system assuming that pages are not mlocked. Despite huge pages being
    non-movable, we do not introduce additional external fragmentation of note as
    huge pages are always the largest contiguous block we care about.

    Credit goes to Andy Whitcroft for catching a large variety of problems during
    review of the patches.

    This patch creates an additional zone, ZONE_MOVABLE. This zone is only usable
    by allocations which specify both __GFP_HIGHMEM and __GFP_MOVABLE. Hot-added
    memory continues to be placed in their existing destination as there is no
    mechanism to redirect them to a specific zone.

    [y-goto@jp.fujitsu.com: Fix section mismatch of memory hotplug related code]
    [akpm@linux-foundation.org: various fixes]
    Signed-off-by: Mel Gorman
    Cc: Andy Whitcroft
    Signed-off-by: Yasunori Goto
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

10 May, 2007

1 commit

  • vmstat is currently using the cache reaper to periodically bring the
    statistics up to date. The cache reaper does only exists in SLUB as a way to
    provide compatibility with SLAB. This patch removes the vmstat calls from the
    slab allocators and provides its own handling.

    The advantage is also that we can use a different frequency for the updates.
    Refreshing vm stats is a pretty fast job so we can run this every second and
    stagger this by only one tick. This will lead to some overlap in large
    systems. F.e a system running at 250 HZ with 1024 processors will have 4 vm
    updates occurring at once.

    However, the vm stats update only accesses per node information. It is only
    necessary to stagger the vm statistics updates per processor in each node. Vm
    counter updates occurring on distant nodes will not cause cacheline
    contention.

    We could implement an alternate approach that runs the first processor on each
    node at the second and then each of the other processor on a node on a
    subsequent tick. That may be useful to keep a large amount of the second free
    of timer activity. Maybe the timer folks will have some feedback on this one?

    [jirislaby@gmail.com: add missing break]
    Cc: Arjan van de Ven
    Signed-off-by: Christoph Lameter
    Signed-off-by: Jiri Slaby
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

12 Feb, 2007

4 commits

  • - Prevent things like this:

    block/ll_rw_blk.c: In function 'submit_bio':
    block/ll_rw_blk.c:3222: warning: unused variable 'count'

    inlines are very, very preferable to macros.

    - remove unused get_cpu_vm_events() macro

    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Make ZONE_DMA optional in core code.

    - ifdef all code for ZONE_DMA and related definitions following the example
    for ZONE_DMA32 and ZONE_HIGHMEM.

    - Without ZONE_DMA, ZONE_HIGHMEM and ZONE_DMA32 we get to a ZONES_SHIFT of
    0.

    - Modify the VM statistics to work correctly without a DMA zone.

    - Modify slab to not create DMA slabs if there is no ZONE_DMA.

    [akpm@osdl.org: cleanup]
    [jdike@addtoit.com: build fix]
    [apw@shadowen.org: Simplify calculation of the number of bits we need for ZONES_SHIFT]
    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Cc: "Luck, Tony"
    Cc: Kyle McMartin
    Cc: Matthew Wilcox
    Cc: James Bottomley
    Cc: Paul Mundt
    Signed-off-by: Andy Whitcroft
    Signed-off-by: Jeff Dike
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • nr_free_pages is now a simple access to a global variable. Make it a macro
    instead of a function.

    The nr_free_pages now requires vmstat.h to be included. There is one
    occurrence in power management where we need to add the include. Directly
    refrer to global_page_state() there to clarify why the #include was added.

    [akpm@osdl.org: arm build fix]
    [akpm@osdl.org: sparc64 build fix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The determination of the dirty ratio to determine writeback behavior is
    currently based on the number of total pages on the system.

    However, not all pages in the system may be dirtied. Thus the ratio is always
    too low and can never reach 100%. The ratio may be particularly skewed if
    large hugepage allocations, slab allocations or device driver buffers make
    large sections of memory not available anymore. In that case we may get into
    a situation in which f.e. the background writeback ratio of 40% cannot be
    reached anymore which leads to undesired writeback behavior.

    This patchset fixes that issue by determining the ratio based on the actual
    pages that may potentially be dirty. These are the pages on the active and
    the inactive list plus free pages.

    The problem with those counts has so far been that it is expensive to
    calculate these because counts from multiple nodes and multiple zones will
    have to be summed up. This patchset makes these counters ZVC counters. This
    means that a current sum per zone, per node and for the whole system is always
    available via global variables and not expensive anymore to calculate.

    The patchset results in some other good side effects:

    - Removal of the various functions that sum up free, active and inactive
    page counts

    - Cleanup of the functions that display information via the proc filesystem.

    This patch:

    The use of a ZVC for nr_inactive and nr_active allows a simplification of some
    counter operations. More ZVC functionality is used for sums etc in the
    following patches.

    [akpm@osdl.org: UP build fix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

23 Dec, 2006

2 commits

  • fix vm_events_fold_cpu() build breakage

    2.6.20-rc1 does not build properly if CONFIG_VM_EVENT_COUNTERS is set
    and CONFIG_HOTPLUG is unset:

    CC init/version.o
    LD init/built-in.o
    LD .tmp_vmlinux1
    mm/built-in.o: In function `page_alloc_cpu_notify':
    page_alloc.c:(.text+0x56eb): undefined reference to `vm_events_fold_cpu'
    make: *** [.tmp_vmlinux1] Error 1

    [akpm@osdl.org: cleanup]
    Signed-off-by: Magnus Damm
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Magnus Damm
     
  • The VM event counters, enabled by CONFIG_VM_EVENT_COUNTERS, which provides
    VM event counters in /proc/vmstat, has become more essential to
    non-EMBEDDED kernel configurations than they were in the past. Comments in
    the code and the Kconfig configuration explanation were stale, downplaying
    their role excessively.

    Refresh those comments to correctly reflect the current role of VM event
    counters.

    Signed-off-by: Paul Jackson
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

30 Sep, 2006

1 commit


26 Sep, 2006

2 commits

  • eventcounters: Do not display counters for zones that are not available on an
    arch

    Do not define or display counters for the DMA32 and the HIGHMEM zone if such
    zones were not configured.

    [akpm@osdl.org: s390 fix]
    [heiko.carstens@de.ibm.com: s390 fix]
    Signed-off-by: Christoph Lameter
    Cc: Martin Schwidefsky
    Signed-off-by: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Make ZONE_DMA32 optional

    - Add #ifdefs around ZONE_DMA32 specific code and definitions.

    - Add CONFIG_ZONE_DMA32 config option and use that for x86_64
    that alone needs this zone.

    - Remove the use of CONFIG_DMA_IS_DMA32 and CONFIG_DMA_IS_NORMAL
    for ia64 and fix up the way per node ZVCs are calculated.

    - Fall back to prior GFP_ZONEMASK of 0x03 if there is no
    DMA32 zone.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

06 Aug, 2006

1 commit


11 Jul, 2006

2 commits

  • It turns out that there is a way to build a kernel with NUMA and no SMP.
    In that case we are missing one definition __inc_zone_state.

    Provide that missing __inc_zone_state.

    (akpm: NUMA && !SMP sounds odd, but I am told "But there is the concept of
    cpuless nodes. A NUMA system without SMP has a single processor but multiple
    memory nodes. This used to work before on IA64 (wasn't aware of it, never seen
    anyone with this kind of thing).")

    Acked-by: Tony Luck
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Dopey bug. Causes hopelessly-wrong numbers from vmstat(8) and several other
    counters.

    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

01 Jul, 2006

3 commits

  • The remaining counters in page_state after the zoned VM counter patches
    have been applied are all just for show in /proc/vmstat. They have no
    essential function for the VM.

    We use a simple increment of per cpu variables. In order to avoid the most
    severe races we disable preempt. Preempt does not prevent the race between
    an increment and an interrupt handler incrementing the same statistics
    counter. However, that race is exceedingly rare, we may only loose one
    increment or so and there is no requirement (at least not in kernel) that
    the vm event counters have to be accurate.

    In the non preempt case this results in a simple increment for each
    counter. For many architectures this will be reduced by the compiler to a
    single instruction. This single instruction is atomic for i386 and x86_64.
    And therefore even the rare race condition in an interrupt is avoided for
    both architectures in most cases.

    The patchset also adds an off switch for embedded systems that allows a
    building of linux kernels without these counters.

    The implementation of these counters is through inline code that hopefully
    results in only a single instruction increment instruction being emitted
    (i386, x86_64) or in the increment being hidden though instruction
    concurrency (EPIC architectures such as ia64 can get that done).

    Benefits:
    - VM event counter operations usually reduce to a single inline instruction
    on i386 and x86_64.
    - No interrupt disable, only preempt disable for the preempt case.
    Preempt disable can also be avoided by moving the counter into a spinlock.
    - Handling is similar to zoned VM counters.
    - Simple and easily extendable.
    - Can be omitted to reduce memory use for embedded use.

    References:

    RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=113512330605497&w=2
    RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=114988082814934&w=2
    local_t http://marc.theaimsgroup.com/?l=linux-kernel&m=114991748606690&w=2
    V2 http://marc.theaimsgroup.com/?t=115014808400007&r=1&w=2
    V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767022346&w=2
    V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115047968808926&w=2

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The numa statistics are really event counters. But they are per node and
    so we have had special treatment for these counters through additional
    fields on the pcp structure. We can now use the per zone nature of the
    zoned VM counters to realize these.

    This will shrink the size of the pcp structure on NUMA systems. We will
    have some room to add additional per zone counters that will all still fit
    in the same cacheline.

    Bits Prior pcp size Size after patch We can add
    ------------------------------------------------------------------
    64 128 bytes (16 words) 80 bytes (10 words) 48
    32 76 bytes (19 words) 56 bytes (14 words) 8 (64 byte cacheline)
    72 (128 byte)

    Remove the special statistics for numa and replace them with zoned vm
    counters. This has the side effect that global sums of these events now
    show up in /proc/vmstat.

    Also take the opportunity to move the zone_statistics() function from
    page_alloc.c into vmstat.c.

    Discussions:
    V2 http://marc.theaimsgroup.com/?t=115048227000002&r=1&w=2

    Signed-off-by: Christoph Lameter
    Acked-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • No callers.

    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton