01 Jan, 2009

1 commit

  • Impact: Use new API

    Convert kernel mm functions to use struct cpumask.

    We skip include/linux/percpu.h and mm/allocpercpu.c, which are in flux.

    Signed-off-by: Rusty Russell
    Signed-off-by: Mike Travis
    Reviewed-by: Christoph Lameter

    Rusty Russell
     

23 Oct, 2008

4 commits


20 Oct, 2008

7 commits

  • Allow free of mlock()ed pages. This shouldn't happen, but during
    developement, it occasionally did.

    This patch allows us to survive that condition, while keeping the
    statistics and events correct for debug.

    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Add NR_MLOCK zone page state, which provides a (conservative) count of
    mlocked pages (actually, the number of mlocked pages moved off the LRU).

    Reworked by lts to fit in with the modified mlock page support in the
    Reclaim Scalability series.

    [kosaki.motohiro@jp.fujitsu.com: fix incorrect Mlocked field of /proc/meminfo]
    [lee.schermerhorn@hp.com: mlocked-pages: add event counting with statistics]
    Signed-off-by: Nick Piggin
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Report unevictable pages per zone and system wide.

    Kosaki Motohiro added support for memory controller unevictable
    statistics.

    [riel@redhat.com: fix printk in show_free_areas()]
    [akpm@linux-foundation.org: fix units in /proc/vmstats]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Debugged-by: Hiroshi Shimamoto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Fix to unevictable-lru-page-statistics.patch

    Add unevictable lru infrastructure vm events to the statistics patch.
    Rename the "NORECL_" and "noreclaim_" symbols and text strings to
    "UNEVICTABLE_" and "unevictable_", respectively.

    Currently, both the infrastructure and the mlocked pages event are
    added by a single patch later in the series. This makes it difficult
    to add or rework the incremental patches. The events actually "belong"
    with the stats, so pull them up to here.

    Also, restore the event counting to putback_lru_page(). This was removed
    from previous patch in series where it was "misplaced". The actual events
    weren't defined that early.

    Signed-off-by: Lee Schermerhorn
    Cc: Rik van Riel
    Reviewed-by: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • We avoid evicting and scanning anonymous pages for the most part, but
    under some workloads we can end up with most of memory filled with
    anonymous pages. At that point, we suddenly need to clear the referenced
    bits on all of memory, which can take ages on very large memory systems.

    We can reduce the maximum number of pages that need to be scanned by not
    taking the referenced state into account when deactivating an anonymous
    page. After all, every anonymous page starts out referenced, so why
    check?

    If an anonymous page gets referenced again before it reaches the end of
    the inactive list, we move it back to the active list.

    To keep the maximum amount of necessary work reasonable, we scale the
    active to inactive ratio with the size of memory, using the formula
    active:inactive ratio = sqrt(memory in GB * 10).

    Kswapd CPU use now seems to scale by the amount of pageout bandwidth,
    instead of by the amount of memory present in the system.

    [kamezawa.hiroyu@jp.fujitsu.com: fix OOM with memcg]
    [kamezawa.hiroyu@jp.fujitsu.com: memcg: lru scan fix]
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Split the LRU lists in two, one set for pages that are backed by real file
    systems ("file") and one for pages that are backed by memory and swap
    ("anon"). The latter includes tmpfs.

    The advantage of doing this is that the VM will not have to scan over lots
    of anonymous pages (which we generally do not want to swap out), just to
    find the page cache pages that it should evict.

    This patch has the infrastructure and a basic policy to balance how much
    we scan the anon lists and how much we scan the file lists. The big
    policy changes are in separate patches.

    [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
    [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
    [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
    [hugh@veritas.com: memcg swapbacked pages active]
    [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
    [akpm@linux-foundation.org: fix /proc/vmstat units]
    [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
    [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
    [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Currently we are defining explicit variables for the inactive and active
    list. An indexed array can be more generic and avoid repeating similar
    code in several places in the reclaim code.

    We are saving a few bytes in terms of code size:

    Before:

    text data bss dec hex filename
    4097753 573120 4092484 8763357 85b7dd vmlinux

    After:

    text data bss dec hex filename
    4097729 573120 4092484 8763333 85b7c5 vmlinux

    Having an easy way to add new lru lists may ease future work on the
    reclaim code.

    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Christoph Lameter
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

28 Aug, 2008

1 commit

  • Ordinarily, memory holes in flatmem still have a valid memmap and is safe
    to use. However, an architecture (ARM) frees up the memmap backing memory
    holes on the assumption it is never used. /proc/pagetypeinfo reads the
    whole range of pages in a zone believing that the memmap is valid and that
    pfn_valid will return false if it is not. On ARM, freeing the memmap breaks
    the page->zone linkages even though pfn_valid() returns true and the kernel
    can oops shortly afterwards due to accessing a bogus struct zone *.

    This patch lets architectures say when FLATMEM can have holes in the
    memmap. Rather than an expensive check for valid memory, /proc/pagetypeinfo
    will confirm that the page linkages are still valid by checking page->zone
    is still the expected zone. The lookup of page_zone is safe as there is a
    limited range of memory that is accessed when calling page_zone. Even if
    page_zone happens to return the correct zone, the impact is that the counters
    in /proc/pagetypeinfo are slightly off but fragmentation monitoring is
    unlikely to be relevant on an embedded system.

    Reported-by: H Hartley Sweeten
    Signed-off-by: Mel Gorman
    Tested-by: H Hartley Sweeten
    Signed-off-by: Russell King

    Mel Gorman
     

25 Jul, 2008

1 commit


24 May, 2008

1 commit


13 May, 2008

1 commit

  • When accessing cpu_online_map, we should prevent dynamic changing
    of cpu_online_map by get_online_cpus().

    Unfortunately, all_vm_events() doesn't do that.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Christoph Lameter
    Cc: Gautham R Shenoy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

30 Apr, 2008

2 commits

  • Fuse will use temporary buffers to write back dirty data from memory mappings
    (normal writes are done synchronously). This is needed, because there cannot
    be any guarantee about the time in which a write will complete.

    By using temporary buffers, from the MM's point if view the page is written
    back immediately. If the writeout was due to memory pressure, this
    effectively migrates data from a full zone to a less full zone.

    This patch adds a new counter (NR_WRITEBACK_TEMP) for the number of pages used
    as temporary buffers.

    [Lee.Schermerhorn@hp.com: add vmstat_text for NR_WRITEBACK_TEMP]
    Signed-off-by: Miklos Szeredi
    Cc: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • on memoryless node, /proc/pagetypeinfo is displayed slightly funny output.
    this patch fix it.

    output example (header is outputed, but no data is outputed)
    --------------------------------------------------------------
    Page block order: 14
    Pages per block: 16384

    Free pages count per migrate type at order 0 1 2 3 4 5 \
    6 7 8 9 10 11 12 13 14 15 16

    Number of blocks type Unmovable Reclaimable Movable Reserve Isolate
    Page block order: 14
    Pages per block: 16384

    Free pages count per migrate type at order 0 1 2 3 4 5 \
    6 7 8 9 10 11 12 13 14 15 16

    Signed-off-by: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

28 Apr, 2008

3 commits

  • We've found that it can take quite a bit of time (100's of usec) to get
    through the zone loop in refresh_cpu_vm_stats().

    Adding a cond_resched() to allow other threads to run in the non-preemptive
    case.

    Signed-off-by: Dimitri Sivanich
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dimitri Sivanich
     
  • Allocating huge pages directly from the buddy allocator is not guaranteed to
    succeed. Success depends on several factors (such as the amount of physical
    memory available and the level of fragmentation). With the addition of
    dynamic hugetlb pool resizing, allocations can occur much more frequently.
    For these reasons it is desirable to keep track of huge page allocation
    successes and failures.

    Add two new vmstat entries to track huge page allocations that succeed and
    fail. The presence of the two entries is contingent upon CONFIG_HUGETLB_PAGE
    being enabled.

    [akpm@linux-foundation.org: reduced ifdeffery]
    Signed-off-by: Adam Litke
    Signed-off-by: Eric Munson
    Tested-by: Mel Gorman
    Reviewed-by: Andy Whitcroft
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • On NUMA, zone_statistics() is used to record events like numa hit, miss and
    foreign. It assumes that the first zone in a zonelist is the preferred zone.
    When multiple zonelists are replaced by one that is filtered, this is no
    longer the case.

    This patch records what the preferred zone is rather than assuming the first
    zone in the zonelist is it. This simplifies the reading of later patches in
    this set.

    Signed-off-by: Mel Gorman
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Reviewed-by: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

16 Apr, 2008

1 commit

  • In a5d76b54a3f3a40385d7f76069a2feac9f1bad63 (memory unplug: page isolation by
    KAMEZAWA Hiroyuki), "isolate" migratetype added. but unfortunately, it
    doesn't treat /proc/pagetypeinfo display logic.

    this patch add "Isolate" to pagetype name field.

    /proc/pagetype
    before:
    ------------------------------------------------------------------------------------------------------------------------
    Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
    Node 0, zone DMA, type Unmovable 1 2 2 2 1 2 2 1 1 0 0
    Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone DMA, type Movable 2 3 3 1 3 3 2 0 0 0 0
    Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 0 1
    Node 0, zone DMA, type 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone Normal, type Unmovable 1 9 7 4 1 1 1 1 0 0 0
    Node 0, zone Normal, type Reclaimable 5 2 0 0 1 1 0 0 0 1 0
    Node 0, zone Normal, type Movable 0 1 1 0 0 0 1 0 0 1 60
    Node 0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 1
    Node 0, zone Normal, type 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone HighMem, type Unmovable 0 0 1 1 1 0 1 1 2 2 0
    Node 0, zone HighMem, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone HighMem, type Movable 236 62 6 2 2 1 1 0 1 1 16
    Node 0, zone HighMem, type Reserve 0 0 0 0 0 0 0 0 0 0 1
    Node 0, zone HighMem, type 0 0 0 0 0 0 0 0 0 0 0

    Number of blocks type Unmovable Reclaimable Movable Reserve
    Node 0, zone DMA 1 0 2 1 0
    Node 0, zone Normal 10 40 169 1 0
    Node 0, zone HighMem 2 0 283 1 0

    after:
    ------------------------------------------------------------------------------------------------------------------------
    Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
    Node 0, zone DMA, type Unmovable 1 2 2 2 1 2 2 1 1 0 0
    Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone DMA, type Movable 2 3 3 1 3 3 2 0 0 0 0
    Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 0 1
    Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone Normal, type Unmovable 0 2 1 1 0 1 0 0 0 0 0
    Node 0, zone Normal, type Reclaimable 1 1 1 1 1 0 1 1 1 0 0
    Node 0, zone Normal, type Movable 0 1 1 1 0 1 0 1 0 0 196
    Node 0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 1
    Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone HighMem, type Unmovable 0 1 0 0 0 1 1 1 2 2 0
    Node 0, zone HighMem, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone HighMem, type Movable 1 0 1 1 0 0 0 0 1 0 200
    Node 0, zone HighMem, type Reserve 0 0 0 0 0 0 0 0 0 0 1
    Node 0, zone HighMem, type Isolate 0 0 0 0 0 0 0 0 0 0 0

    Number of blocks type Unmovable Reclaimable Movable Reserve Isolate
    Node 0, zone DMA 1 0 2 1 0
    Node 0, zone Normal 8 4 207 1 0
    Node 0, zone HighMem 2 0 283 1 0

    Signed-off-by: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

06 Feb, 2008

3 commits

  • Remove the prefetch logic in order to avoid touching impossible per cpu
    areas.

    Signed-off-by: Christoph Lameter
    Cc: Mike Travis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • We have repeatedly discussed if the cold pages still have a point. There is
    one way to join the two lists: Use a single list and put the cold pages at the
    end and the hot pages at the beginning. That way a single list can serve for
    both types of allocations.

    The discussion of the RFC for this and Mel's measurements indicate that
    there may not be too much of a point left to having separate lists for
    hot and cold pages (see http://marc.info/?t=119492914200001&r=1&w=2).

    Signed-off-by: Christoph Lameter
    Cc: Mel Gorman
    Cc: Martin Bligh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • 1. Add comments explaining how the function can be called.

    2. Collect global diffs in a local array and only spill
    them once into the global counters when the zone scan
    is finished. This means that we only touch each global
    counter once instead of each time we fold cpu counters
    into zone counters.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

15 Nov, 2007

1 commit

  • Mark start_cpu_timer() as __cpuinit instead of __devinit.
    Fixes this section warning:

    WARNING: vmlinux.o(.text+0x60e53): Section mismatch: reference to .init.text:start_cpu_timer (between 'vmstat_cpuup_callback' and 'vmstat_show')

    Signed-off-by: Randy Dunlap
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

17 Oct, 2007

3 commits

  • Convert the int all_unreclaimable member of struct zone to unsigned long
    flags. This can now be used to specify several different zone flags such as
    all_unreclaimable and reclaim_in_progress, which can now be removed and
    converted to a per-zone flag.

    Flags are set and cleared as follows:

    zone_set_flag(struct zone *zone, zone_flags_t flag)
    zone_clear_flag(struct zone *zone, zone_flags_t flag)

    Defines the first zone flags, ZONE_ALL_UNRECLAIMABLE and ZONE_RECLAIM_LOCKED,
    which have the same semantics as the old zone->all_unreclaimable and
    zone->reclaim_in_progress, respectively. Also converts all current users that
    set or clear either flag to use the new interface.

    Helper functions are defined to test the flags:

    int zone_is_all_unreclaimable(const struct zone *zone)
    int zone_is_reclaim_locked(const struct zone *zone)

    All flag operators are of the atomic variety because there are currently
    readers that are implemented that do not take zone->lock.

    [akpm@linux-foundation.org: add needed include]
    Cc: Andrea Arcangeli
    Acked-by: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • This patch contains the following cleanups:
    - make the needlessly global setup_vmstat() static
    - remove the unused refresh_vm_stats()

    Signed-off-by: Adrian Bunk
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • This patch provides fragmentation avoidance statistics via /proc/pagetypeinfo.
    The information is collected only on request so there is no runtime overhead.
    The statistics are in three parts:

    The first part prints information on the size of blocks that pages are
    being grouped on and looks like

    Page block order: 10
    Pages per block: 1024

    The second part is a more detailed version of /proc/buddyinfo and looks like

    Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
    Node 0, zone DMA, type Unmovable 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone DMA, type Reclaimable 1 0 0 0 0 0 0 0 0 0 0
    Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone DMA, type Reserve 0 4 4 0 0 0 0 1 0 1 0
    Node 0, zone Normal, type Unmovable 111 8 4 4 2 3 1 0 0 0 0
    Node 0, zone Normal, type Reclaimable 293 89 8 0 0 0 0 0 0 0 0
    Node 0, zone Normal, type Movable 1 6 13 9 7 6 3 0 0 0 0
    Node 0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 4

    The third part looks like

    Number of blocks type Unmovable Reclaimable Movable Reserve
    Node 0, zone DMA 0 1 2 1
    Node 0, zone Normal 3 17 94 4

    To walk the zones within a node with interrupts disabled, walk_zones_in_node()
    is introduced and shared between /proc/buddyinfo, /proc/zoneinfo and
    /proc/pagetypeinfo to reduce code duplication. It seems specific to what
    vmstat.c requires but could be broken out as a general utility function in
    mmzone.c if there were other other potential users.

    Signed-off-by: Mel Gorman
    Acked-by: Andy Whitcroft
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

30 Jul, 2007

1 commit

  • Remove fs.h from mm.h. For this,
    1) Uninline vma_wants_writenotify(). It's pretty huge anyway.
    2) Add back fs.h or less bloated headers (err.h) to files that need it.

    As result, on x86_64 allyesconfig, fs.h dependencies cut down from 3929 files
    rebuilt down to 3444 (-12.3%).

    Cross-compile tested without regressions on my two usual configs and (sigh):

    alpha arm-mx1ads mips-bigsur powerpc-ebony
    alpha-allnoconfig arm-neponset mips-capcella powerpc-g5
    alpha-defconfig arm-netwinder mips-cobalt powerpc-holly
    alpha-up arm-netx mips-db1000 powerpc-iseries
    arm arm-ns9xxx mips-db1100 powerpc-linkstation
    arm-assabet arm-omap_h2_1610 mips-db1200 powerpc-lite5200
    arm-at91rm9200dk arm-onearm mips-db1500 powerpc-maple
    arm-at91rm9200ek arm-picotux200 mips-db1550 powerpc-mpc7448_hpc2
    arm-at91sam9260ek arm-pleb mips-ddb5477 powerpc-mpc8272_ads
    arm-at91sam9261ek arm-pnx4008 mips-decstation powerpc-mpc8313_rdb
    arm-at91sam9263ek arm-pxa255-idp mips-e55 powerpc-mpc832x_mds
    arm-at91sam9rlek arm-realview mips-emma2rh powerpc-mpc832x_rdb
    arm-ateb9200 arm-realview-smp mips-excite powerpc-mpc834x_itx
    arm-badge4 arm-rpc mips-fulong powerpc-mpc834x_itxgp
    arm-carmeva arm-s3c2410 mips-ip22 powerpc-mpc834x_mds
    arm-cerfcube arm-shannon mips-ip27 powerpc-mpc836x_mds
    arm-clps7500 arm-shark mips-ip32 powerpc-mpc8540_ads
    arm-collie arm-simpad mips-jazz powerpc-mpc8544_ds
    arm-corgi arm-spitz mips-jmr3927 powerpc-mpc8560_ads
    arm-csb337 arm-trizeps4 mips-malta powerpc-mpc8568mds
    arm-csb637 arm-versatile mips-mipssim powerpc-mpc85xx_cds
    arm-ebsa110 i386 mips-mpc30x powerpc-mpc8641_hpcn
    arm-edb7211 i386-allnoconfig mips-msp71xx powerpc-mpc866_ads
    arm-em_x270 i386-defconfig mips-ocelot powerpc-mpc885_ads
    arm-ep93xx i386-up mips-pb1100 powerpc-pasemi
    arm-footbridge ia64 mips-pb1500 powerpc-pmac32
    arm-fortunet ia64-allnoconfig mips-pb1550 powerpc-ppc64
    arm-h3600 ia64-bigsur mips-pnx8550-jbs powerpc-prpmc2800
    arm-h7201 ia64-defconfig mips-pnx8550-stb810 powerpc-ps3
    arm-h7202 ia64-gensparse mips-qemu powerpc-pseries
    arm-hackkit ia64-sim mips-rbhma4200 powerpc-up
    arm-integrator ia64-sn2 mips-rbhma4500 s390
    arm-iop13xx ia64-tiger mips-rm200 s390-allnoconfig
    arm-iop32x ia64-up mips-sb1250-swarm s390-defconfig
    arm-iop33x ia64-zx1 mips-sead s390-up
    arm-ixp2000 m68k mips-tb0219 sparc
    arm-ixp23xx m68k-amiga mips-tb0226 sparc-allnoconfig
    arm-ixp4xx m68k-apollo mips-tb0287 sparc-defconfig
    arm-jornada720 m68k-atari mips-workpad sparc-up
    arm-kafa m68k-bvme6000 mips-wrppmc sparc64
    arm-kb9202 m68k-hp300 mips-yosemite sparc64-allnoconfig
    arm-ks8695 m68k-mac parisc sparc64-defconfig
    arm-lart m68k-mvme147 parisc-allnoconfig sparc64-up
    arm-lpd270 m68k-mvme16x parisc-defconfig um-x86_64
    arm-lpd7a400 m68k-q40 parisc-up x86_64
    arm-lpd7a404 m68k-sun3 powerpc x86_64-allnoconfig
    arm-lubbock m68k-sun3x powerpc-cell x86_64-defconfig
    arm-lusl7200 mips powerpc-celleb x86_64-up
    arm-mainstone mips-atlas powerpc-chrp32

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

18 Jul, 2007

1 commit

  • The following 8 patches against 2.6.20-mm2 create a zone called ZONE_MOVABLE
    that is only usable by allocations that specify both __GFP_HIGHMEM and
    __GFP_MOVABLE. This has the effect of keeping all non-movable pages within a
    single memory partition while allowing movable allocations to be satisfied
    from either partition. The patches may be applied with the list-based
    anti-fragmentation patches that groups pages together based on mobility.

    The size of the zone is determined by a kernelcore= parameter specified at
    boot-time. This specifies how much memory is usable by non-movable
    allocations and the remainder is used for ZONE_MOVABLE. Any range of pages
    within ZONE_MOVABLE can be released by migrating the pages or by reclaiming.

    When selecting a zone to take pages from for ZONE_MOVABLE, there are two
    things to consider. First, only memory from the highest populated zone is
    used for ZONE_MOVABLE. On the x86, this is probably going to be ZONE_HIGHMEM
    but it would be ZONE_DMA on ppc64 or possibly ZONE_DMA32 on x86_64. Second,
    the amount of memory usable by the kernel will be spread evenly throughout
    NUMA nodes where possible. If the nodes are not of equal size, the amount of
    memory usable by the kernel on some nodes may be greater than others.

    By default, the zone is not as useful for hugetlb allocations because they are
    pinned and non-migratable (currently at least). A sysctl is provided that
    allows huge pages to be allocated from that zone. This means that the huge
    page pool can be resized to the size of ZONE_MOVABLE during the lifetime of
    the system assuming that pages are not mlocked. Despite huge pages being
    non-movable, we do not introduce additional external fragmentation of note as
    huge pages are always the largest contiguous block we care about.

    Credit goes to Andy Whitcroft for catching a large variety of problems during
    review of the patches.

    This patch creates an additional zone, ZONE_MOVABLE. This zone is only usable
    by allocations which specify both __GFP_HIGHMEM and __GFP_MOVABLE. Hot-added
    memory continues to be placed in their existing destination as there is no
    mechanism to redirect them to a specific zone.

    [y-goto@jp.fujitsu.com: Fix section mismatch of memory hotplug related code]
    [akpm@linux-foundation.org: various fixes]
    Signed-off-by: Mel Gorman
    Cc: Andy Whitcroft
    Signed-off-by: Yasunori Goto
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

07 Jul, 2007

1 commit

  • Line up the vmstat_text with zone_stat_item

    enum zone_stat_item {
    /* First 128 byte cacheline (assuming 64 bit words) */
    NR_FREE_PAGES,
    NR_INACTIVE,
    NR_ACTIVE,

    We current have nr_active and nr_inactive reversed.

    [ "OK with patch, though using initializers canbe handy to prevent such
    things in future:

    static const char * const vmstat_text[] = {
    [NR_FREE_PAGES] = "nr_free_pages",
    ..."
    - Alexey ]

    Signed-off-by: Peter Zijlstra
    Acked-by: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

22 May, 2007

1 commit

  • First thing mm.h does is including sched.h solely for can_do_mlock() inline
    function which has "current" dereference inside. By dealing with can_do_mlock()
    mm.h can be detached from sched.h which is good. See below, why.

    This patch
    a) removes unconditional inclusion of sched.h from mm.h
    b) makes can_do_mlock() normal function in mm/mlock.c
    c) exports can_do_mlock() to not break compilation
    d) adds sched.h inclusions back to files that were getting it indirectly.
    e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
    getting them indirectly

    Net result is:
    a) mm.h users would get less code to open, read, preprocess, parse, ... if
    they don't need sched.h
    b) sched.h stops being dependency for significant number of files:
    on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
    after patch it's only 3744 (-8.3%).

    Cross-compile tested on

    all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
    alpha alpha-up
    arm
    i386 i386-up i386-defconfig i386-allnoconfig
    ia64 ia64-up
    m68k
    mips
    parisc parisc-up
    powerpc powerpc-up
    s390 s390-up
    sparc sparc-up
    sparc64 sparc64-up
    um-x86_64
    x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig

    as well as my two usual configs.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

11 May, 2007

1 commit

  • VM statistics updates do not matter if the kernel is in idle powersaving
    mode. So allow the timer to be deferred.

    It would be better though if we could switch the timer between deferrable
    and nondeferrable based on differentials present. The timer would start
    out nondeferrable and if we find that there were no updates in the last
    statistics interval then we would switch the timer to deferrable. If the
    timer later finds again that there are differentials then go to
    nondeferrable again.

    And yet another way would be to run the timer shortly before going to idle?

    The solution here means that the VM counters may be slightly off during
    idle since differentials may be still pending while the timer is deferred.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

10 May, 2007

4 commits

  • Currently the slab allocators contain callbacks into the page allocator to
    perform the draining of pagesets on remote nodes. This requires SLUB to have
    a whole subsystem in order to be compatible with SLAB. Moving node draining
    out of the slab allocators avoids a section of code in SLUB.

    Move the node draining so that is is done when the vm statistics are updated.
    At that point we are already touching all the cachelines with the pagesets of
    a processor.

    Add a expire counter there. If we have to update per zone or global vm
    statistics then assume that the pageset will require subsequent draining.

    The expire counter will be decremented on each vm stats update pass until it
    reaches zero. Then we will drain one batch from the pageset. The draining
    will cause vm counter updates which will then cause another expiration until
    the pcp is empty. So we will drain a batch every 3 seconds.

    Note that remote node draining is a somewhat esoteric feature that is required
    on large NUMA systems because otherwise significant portions of system memory
    can become trapped in pcp queues. The number of pcp is determined by the
    number of processors and nodes in a system. A system with 4 processors and 2
    nodes has 8 pcps which is okay. But a system with 1024 processors and 512
    nodes has 512k pcps with a high potential for large amount of memory being
    caught in them.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Make it configurable. Code in mm makes the vm statistics intervals
    independent from the cache reaper use that opportunity to make it
    configurable.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • vmstat is currently using the cache reaper to periodically bring the
    statistics up to date. The cache reaper does only exists in SLUB as a way to
    provide compatibility with SLAB. This patch removes the vmstat calls from the
    slab allocators and provides its own handling.

    The advantage is also that we can use a different frequency for the updates.
    Refreshing vm stats is a pretty fast job so we can run this every second and
    stagger this by only one tick. This will lead to some overlap in large
    systems. F.e a system running at 250 HZ with 1024 processors will have 4 vm
    updates occurring at once.

    However, the vm stats update only accesses per node information. It is only
    necessary to stagger the vm statistics updates per processor in each node. Vm
    counter updates occurring on distant nodes will not cause cacheline
    contention.

    We could implement an alternate approach that runs the first processor on each
    node at the second and then each of the other processor on a node on a
    subsequent tick. That may be useful to keep a large amount of the second free
    of timer activity. Maybe the timer folks will have some feedback on this one?

    [jirislaby@gmail.com: add missing break]
    Cc: Arjan van de Ven
    Signed-off-by: Christoph Lameter
    Signed-off-by: Jiri Slaby
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Since nonboot CPUs are now disabled after tasks and devices have been
    frozen and the CPU hotplug infrastructure is used for this purpose, we need
    special CPU hotplug notifications that will help the CPU-hotplug-aware
    subsystems distinguish normal CPU hotplug events from CPU hotplug events
    related to a system-wide suspend or resume operation in progress. This
    patch introduces such notifications and causes them to be used during
    suspend and resume transitions. It also changes all of the
    CPU-hotplug-aware subsystems to take these notifications into consideration
    (for now they are handled in the same way as the corresponding "normal"
    ones).

    [oleg@tv-sign.ru: cleanups]
    Signed-off-by: Rafael J. Wysocki
    Cc: Gautham R Shenoy
    Cc: Pavel Machek
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     

12 Feb, 2007

2 commits

  • Make ZONE_DMA optional in core code.

    - ifdef all code for ZONE_DMA and related definitions following the example
    for ZONE_DMA32 and ZONE_HIGHMEM.

    - Without ZONE_DMA, ZONE_HIGHMEM and ZONE_DMA32 we get to a ZONES_SHIFT of
    0.

    - Modify the VM statistics to work correctly without a DMA zone.

    - Modify slab to not create DMA slabs if there is no ZONE_DMA.

    [akpm@osdl.org: cleanup]
    [jdike@addtoit.com: build fix]
    [apw@shadowen.org: Simplify calculation of the number of bits we need for ZONES_SHIFT]
    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Cc: "Luck, Tony"
    Cc: Kyle McMartin
    Cc: Matthew Wilcox
    Cc: James Bottomley
    Cc: Paul Mundt
    Signed-off-by: Andy Whitcroft
    Signed-off-by: Jeff Dike
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Values are available via ZVC sums.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter