12 Dec, 2009

1 commit


24 Sep, 2009

1 commit

  • It's unused.

    It isn't needed -- read or write flag is already passed and sysctl
    shouldn't care about the rest.

    It _was_ used in two places at arch/frv for some reason.

    Signed-off-by: Alexey Dobriyan
    Cc: David Howells
    Cc: "Eric W. Biederman"
    Cc: Al Viro
    Cc: Ralf Baechle
    Cc: Martin Schwidefsky
    Cc: Ingo Molnar
    Cc: "David S. Miller"
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

22 Sep, 2009

5 commits

  • The following two patches remove searching in the page allocator fast-path
    by maintaining multiple free-lists in the per-cpu structure. At the time
    the search was introduced, increasing the per-cpu structures would waste a
    lot of memory as per-cpu structures were statically allocated at
    compile-time. This is no longer the case.

    The patches are as follows. They are based on mmotm-2009-08-27.

    Patch 1 adds multiple lists to struct per_cpu_pages, one per
    migratetype that can be stored on the PCP lists.

    Patch 2 notes that the pcpu drain path check empty lists multiple times. The
    patch reduces the number of checks by maintaining a count of free
    lists encountered. Lists containing pages will then free multiple
    pages in batch

    The patches were tested with kernbench, netperf udp/tcp, hackbench and
    sysbench. The netperf tests were not bound to any CPU in particular and
    were run such that the results should be 99% confidence that the reported
    results are within 1% of the estimated mean. sysbench was run with a
    postgres background and read-only tests. Similar to netperf, it was run
    multiple times so that it's 99% confidence results are within 1%. The
    patches were tested on x86, x86-64 and ppc64 as

    x86: Intel Pentium D 3GHz with 8G RAM (no-brand machine)
    kernbench - No significant difference, variance well within noise
    netperf-udp - 1.34% to 2.28% gain
    netperf-tcp - 0.45% to 1.22% gain
    hackbench - Small variances, very close to noise
    sysbench - Very small gains

    x86-64: AMD Phenom 9950 1.3GHz with 8G RAM (no-brand machine)
    kernbench - No significant difference, variance well within noise
    netperf-udp - 1.83% to 10.42% gains
    netperf-tcp - No conclusive until buffer >= PAGE_SIZE
    4096 +15.83%
    8192 + 0.34% (not significant)
    16384 + 1%
    hackbench - Small gains, very close to noise
    sysbench - 0.79% to 1.6% gain

    ppc64: PPC970MP 2.5GHz with 10GB RAM (it's a terrasoft powerstation)
    kernbench - No significant difference, variance well within noise
    netperf-udp - 2-3% gain for almost all buffer sizes tested
    netperf-tcp - losses on small buffers, gains on larger buffers
    possibly indicates some bad caching effect.
    hackbench - No significant difference
    sysbench - 2-4% gain

    This patch:

    Currently the per-cpu page allocator searches the PCP list for pages of
    the correct migrate-type to reduce the possibility of pages being
    inappropriate placed from a fragmentation perspective. This search is
    potentially expensive in a fast-path and undesirable. Splitting the
    per-cpu list into multiple lists increases the size of a per-cpu structure
    and this was potentially a major problem at the time the search was
    introduced. These problem has been mitigated as now only the necessary
    number of structures is allocated for the running system.

    This patch replaces a list search in the per-cpu allocator with one list
    per migrate type. The potential snag with this approach is when bulk
    freeing pages. We round-robin free pages based on migrate type which has
    little bearing on the cache hotness of the page and potentially checks
    empty lists repeatedly in the event the majority of PCP pages are of one
    type.

    Signed-off-by: Mel Gorman
    Acked-by: Nick Piggin
    Cc: Christoph Lameter
    Cc: Minchan Kim
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1, in
    which case shrink_list() _still_ calls isolate_pages() with the much
    larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list scan
    rate by up to 32 times.

    For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
    So when shrink_zone() expects to scan 4 pages in the active/inactive list,
    the active list will be scanned 4 pages, while the inactive list will be
    (over) scanned SWAP_CLUSTER_MAX=32 pages in effect. And that could break
    the balance between the two lists.

    It can further impact the scan of anon active list, due to the anon
    active/inactive ratio rebalance logic in balance_pgdat()/shrink_zone():

    inactive anon list over scanned => inactive_anon_is_low() == TRUE
    => shrink_active_list()
    => active anon list over scanned

    So the end result may be

    - anon inactive => over scanned
    - anon active => over scanned (maybe not as much)
    - file inactive => over scanned
    - file active => under scanned (relatively)

    The accesses to nr_saved_scan are not lock protected and so not 100%
    accurate, however we can tolerate small errors and the resulted small
    imbalanced scan rates between zones.

    Cc: Rik van Riel
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Balbir Singh
    Reviewed-by: Minchan Kim
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • If the system is running a heavy load of processes then concurrent reclaim
    can isolate a large number of pages from the LRU. /proc/vmstat and the
    output generated for an OOM do not show how many pages were isolated.

    This has been observed during process fork bomb testing (mstctl11 in LTP).

    This patch shows the information about isolated pages.

    Reproduced via:

    -----------------------
    % ./hackbench 140 process 1000
    => OOM occur

    active_anon:146 inactive_anon:0 isolated_anon:49245
    active_file:79 inactive_file:18 isolated_file:113
    unevictable:0 dirty:0 writeback:0 unstable:0 buffer:39
    free:370 slab_reclaimable:309 slab_unreclaimable:5492
    mapped:53 shmem:15 pagetables:28140 bounce:0

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Acked-by: Wu Fengguang
    Reviewed-by: Minchan Kim
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Recently we encountered OOM problems due to memory use of the GEM cache.
    Generally a large amuont of Shmem/Tmpfs pages tend to create a memory
    shortage problem.

    We often use the following calculation to determine the amount of shmem
    pages:

    shmem = NR_ACTIVE_ANON + NR_INACTIVE_ANON - NR_ANON_PAGES

    however the expression does not consider isolated and mlocked pages.

    This patch adds explicit accounting for pages used by shmem and tmpfs.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Reviewed-by: Christoph Lameter
    Acked-by: Wu Fengguang
    Cc: David Rientjes
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The amount of memory allocated to kernel stacks can become significant and
    cause OOM conditions. However, we do not display the amount of memory
    consumed by stacks.

    Add code to display the amount of memory used for stacks in /proc/meminfo.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Reviewed-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

17 Jun, 2009

4 commits

  • Currently, nobody wants to turn UNEVICTABLE_LRU off. Thus this
    configurability is unnecessary.

    Signed-off-by: KOSAKI Motohiro
    Cc: Johannes Weiner
    Cc: Andi Kleen
    Acked-by: Minchan Kim
    Cc: David Woodhouse
    Cc: Matt Mackall
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The vmscan batching logic is twisting. Move it into a standalone function
    nr_scan_try_batch() and document it. No behavior change.

    Signed-off-by: Wu Fengguang
    Acked-by: Rik van Riel
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Acked-by: Peter Zijlstra
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • ALLOC_WMARK_MIN, ALLOC_WMARK_LOW and ALLOC_WMARK_HIGH determin whether
    pages_min, pages_low or pages_high is used as the zone watermark when
    allocating the pages. Two branches in the allocator hotpath determine
    which watermark to use.

    This patch uses the flags as an array index into a watermark array that is
    indexed with WMARK_* defines accessed via helpers. All call sites that
    use zone->pages_* are updated to use the helpers for accessing the values
    and the array offsets for setting.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    Cc: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • On low-memory systems, anti-fragmentation gets disabled as there is
    nothing it can do and it would just incur overhead shuffling pages between
    lists constantly. Currently the check is made in the free page fast path
    for every page. This patch moves it to a slow path. On machines with low
    memory, there will be small amount of additional overhead as pages get
    shuffled between lists but it should quickly settle.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Cc: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

18 May, 2009

1 commit

  • pfn_valid() is meant to be able to tell if a given PFN has valid memmap
    associated with it or not. In FLATMEM, it is expected that holes always
    have valid memmap as long as there is valid PFNs either side of the hole.
    In SPARSEMEM, it is assumed that a valid section has a memmap for the
    entire section.

    However, ARM and maybe other embedded architectures in the future free
    memmap backing holes to save memory on the assumption the memmap is never
    used. The page_zone linkages are then broken even though pfn_valid()
    returns true. A walker of the full memmap must then do this additional
    check to ensure the memmap they are looking at is sane by making sure the
    zone and PFN linkages are still valid. This is expensive, but walkers of
    the full memmap are extremely rare.

    This was caught before for FLATMEM and hacked around but it hits again for
    SPARSEMEM because the page_zone linkages can look ok where the PFN linkages
    are totally screwed. This looks like a hatchet job but the reality is that
    any clean solution would end up consumning all the memory saved by punching
    these unexpected holes in the memmap. For example, we tried marking the
    memmap within the section invalid but the section size exceeds the size of
    the hole in most cases so pfn_valid() starts returning false where valid
    memmap exists. Shrinking the size of the section would increase memory
    consumption offsetting the gains.

    This patch identifies when an architecture is punching unexpected holes
    in the memmap that the memory model cannot automatically detect and sets
    ARCH_HAS_HOLES_MEMORYMODEL. At the moment, this is restricted to EP93xx
    which is the model sub-architecture this has been reported on but may expand
    later. When set, walkers of the full memmap must call memmap_valid_within()
    for each PFN and passing in what it expects the page and zone to be for
    that PFN. If it finds the linkages to be broken, it assumes the memmap is
    invalid for that PFN.

    Signed-off-by: Mel Gorman
    Signed-off-by: Russell King

    Mel Gorman
     

06 Apr, 2009

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask: (36 commits)
    cpumask: remove cpumask allocation from idle_balance, fix
    numa, cpumask: move numa_node_id default implementation to topology.h, fix
    cpumask: remove cpumask allocation from idle_balance
    x86: cpumask: x86 mmio-mod.c use cpumask_var_t for downed_cpus
    x86: cpumask: update 32-bit APM not to mug current->cpus_allowed
    x86: microcode: cleanup
    x86: cpumask: use work_on_cpu in arch/x86/kernel/microcode_core.c
    cpumask: fix CONFIG_CPUMASK_OFFSTACK=y cpu hotunplug crash
    numa, cpumask: move numa_node_id default implementation to topology.h
    cpumask: convert node_to_cpumask_map[] to cpumask_var_t
    cpumask: remove x86 cpumask_t uses.
    cpumask: use cpumask_var_t in uv_flush_tlb_others.
    cpumask: remove cpumask_t assignment from vector_allocation_domain()
    cpumask: make Xen use the new operators.
    cpumask: clean up summit's send_IPI functions
    cpumask: use new cpumask functions throughout x86
    x86: unify cpu_callin_mask/cpu_callout_mask/cpu_initialized_mask/cpu_sibling_setup_mask
    cpumask: convert struct cpuinfo_x86's llc_shared_map to cpumask_var_t
    cpumask: convert node_to_cpumask_map[] to cpumask_var_t
    x86: unify 32 and 64-bit node_to_cpumask_map
    ...

    Linus Torvalds
     

01 Apr, 2009

1 commit

  • Impact: cleanup

    In almost cases, for_each_zone() is used with populated_zone(). It's
    because almost function doesn't need memoryless node information.
    Therefore, for_each_populated_zone() can help to make code simplify.

    This patch has no functional change.

    [akpm@linux-foundation.org: small cleanup]
    Signed-off-by: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Reviewed-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

13 Mar, 2009

1 commit

  • Impact: cleanup, potential bugfix

    Not sure what changed to expose this, but clearly that numa_node_id()
    doesn't belong in mmzone.h (the inline in gfp.h is probably overkill, too).

    In file included from include/linux/topology.h:34,
    from arch/x86/mm/numa.c:2:
    /home/rusty/patches-cpumask/linux-2.6/arch/x86/include/asm/topology.h:64:1: warning: "numa_node_id" redefined
    In file included from include/linux/topology.h:32,
    from arch/x86/mm/numa.c:2:
    include/linux/mmzone.h:770:1: warning: this is the location of the previous definition

    Signed-off-by: Rusty Russell
    Cc: Mike Travis
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Rusty Russell
     

19 Feb, 2009

1 commit

  • Now, early_pfn_in_nid(PFN, NID) may returns false if PFN is a hole.
    and memmap initialization was not done. This was a trouble for
    sparc boot.

    To fix this, the PFN should be initialized and marked as PG_reserved.
    This patch changes early_pfn_in_nid() return true if PFN is a hole.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reported-by: David Miller
    Tested-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Heiko Carstens
    Cc: [2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

09 Jan, 2009

1 commit

  • Add zone_reclam_stat struct for later enhancement.

    A later patch uses this. This patch doesn't any behavior change (yet).

    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

20 Oct, 2008

6 commits

  • Allocate all page_cgroup at boot and remove page_cgroup poitner from
    struct page. This patch adds an interface as

    struct page_cgroup *lookup_page_cgroup(struct page*)

    All FLATMEM/DISCONTIGMEM/SPARSEMEM and MEMORY_HOTPLUG is supported.

    Remove page_cgroup pointer reduces the amount of memory by
    - 4 bytes per PAGE_SIZE.
    - 8 bytes per PAGE_SIZE
    if memory controller is disabled. (even if configured.)

    On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory.
    On my x86-64 server with 48GB of memory, this saves 96MB of memory.
    I think this reduction makes sense.

    By pre-allocation, kmalloc/kfree in charge/uncharge are removed.
    This means
    - we're not necessary to be afraid of kmalloc faiulre.
    (this can happen because of gfp_mask type.)
    - we can avoid calling kmalloc/kfree.
    - we can avoid allocating tons of small objects which can be fragmented.
    - we can know what amount of memory will be used for this extra-lru handling.

    I added printk message as

    "allocated %ld bytes of page_cgroup"
    "please try cgroup_disable=memory option if you don't want"

    maybe enough informative for users.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Add NR_MLOCK zone page state, which provides a (conservative) count of
    mlocked pages (actually, the number of mlocked pages moved off the LRU).

    Reworked by lts to fit in with the modified mlock page support in the
    Reclaim Scalability series.

    [kosaki.motohiro@jp.fujitsu.com: fix incorrect Mlocked field of /proc/meminfo]
    [lee.schermerhorn@hp.com: mlocked-pages: add event counting with statistics]
    Signed-off-by: Nick Piggin
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • When the system contains lots of mlocked or otherwise unevictable pages,
    the pageout code (kswapd) can spend lots of time scanning over these
    pages. Worse still, the presence of lots of unevictable pages can confuse
    kswapd into thinking that more aggressive pageout modes are required,
    resulting in all kinds of bad behaviour.

    Infrastructure to manage pages excluded from reclaim--i.e., hidden from
    vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked to
    maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
    them from vmscan.

    Kosaki Motohiro added the support for the memory controller unevictable
    lru list.

    Pages on the unevictable list have both PG_unevictable and PG_lru set.
    Thus, PG_unevictable is analogous to and mutually exclusive with
    PG_active--it specifies which LRU list the page is on.

    The unevictable infrastructure is enabled by a new mm Kconfig option
    [CONFIG_]UNEVICTABLE_LRU.

    A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
    not a page may be evictable. Subsequent patches will add the various
    !evictable tests. We'll want to keep these tests light-weight for use in
    shrink_active_list() and, possibly, the fault path.

    To avoid races between tasks putting pages [back] onto an LRU list and
    tasks that might be moving the page from non-evictable to evictable state,
    the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
    -- tests the "evictability" of a page after placing it on the LRU, before
    dropping the reference. If the page has become unevictable,
    putback_lru_page() will redo the 'putback', thus moving the page to the
    unevictable list. This way, we avoid "stranding" evictable pages on the
    unevictable list.

    [akpm@linux-foundation.org: fix fallout from out-of-order merge]
    [riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
    [nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
    [kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
    [kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
    [kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
    [kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
    [kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Debugged-by: Benjamin Kidwell
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • We avoid evicting and scanning anonymous pages for the most part, but
    under some workloads we can end up with most of memory filled with
    anonymous pages. At that point, we suddenly need to clear the referenced
    bits on all of memory, which can take ages on very large memory systems.

    We can reduce the maximum number of pages that need to be scanned by not
    taking the referenced state into account when deactivating an anonymous
    page. After all, every anonymous page starts out referenced, so why
    check?

    If an anonymous page gets referenced again before it reaches the end of
    the inactive list, we move it back to the active list.

    To keep the maximum amount of necessary work reasonable, we scale the
    active to inactive ratio with the size of memory, using the formula
    active:inactive ratio = sqrt(memory in GB * 10).

    Kswapd CPU use now seems to scale by the amount of pageout bandwidth,
    instead of by the amount of memory present in the system.

    [kamezawa.hiroyu@jp.fujitsu.com: fix OOM with memcg]
    [kamezawa.hiroyu@jp.fujitsu.com: memcg: lru scan fix]
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Split the LRU lists in two, one set for pages that are backed by real file
    systems ("file") and one for pages that are backed by memory and swap
    ("anon"). The latter includes tmpfs.

    The advantage of doing this is that the VM will not have to scan over lots
    of anonymous pages (which we generally do not want to swap out), just to
    find the page cache pages that it should evict.

    This patch has the infrastructure and a basic policy to balance how much
    we scan the anon lists and how much we scan the file lists. The big
    policy changes are in separate patches.

    [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
    [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
    [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
    [hugh@veritas.com: memcg swapbacked pages active]
    [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
    [akpm@linux-foundation.org: fix /proc/vmstat units]
    [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
    [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
    [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Currently we are defining explicit variables for the inactive and active
    list. An indexed array can be more generic and avoid repeating similar
    code in several places in the reclaim code.

    We are saving a few bytes in terms of code size:

    Before:

    text data bss dec hex filename
    4097753 573120 4092484 8763357 85b7dd vmlinux

    After:

    text data bss dec hex filename
    4097729 573120 4092484 8763333 85b7c5 vmlinux

    Having an easy way to add new lru lists may ease future work on the
    reclaim code.

    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Christoph Lameter
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

14 Sep, 2008

1 commit

  • The iterator for_each_zone_zonelist() uses a struct zoneref *z cursor when
    scanning zonelists to keep track of where in the zonelist it is. The
    zoneref that is returned corresponds to the the next zone that is to be
    scanned, not the current one. It was intended to be treated as an opaque
    list.

    When the page allocator is scanning a zonelist, it marks elements in the
    zonelist corresponding to zones that are temporarily full. As the
    zonelist is being updated, it uses the cursor here;

    if (NUMA_BUILD)
    zlc_mark_zone_full(zonelist, z);

    This is intended to prevent rescanning in the near future but the zoneref
    cursor does not correspond to the zone that has been found to be full.
    This is an easy misunderstanding to make so this patch corrects the
    problem by changing zoneref cursor to be the current zone being scanned
    instead of the next one.

    Signed-off-by: Mel Gorman
    Cc: Andy Whitcroft
    Cc: KAMEZAWA Hiroyuki
    Cc: [2.6.26.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

25 May, 2008

1 commit


30 Apr, 2008

2 commits

  • Remove the "#ifdef __KERNEL__" tests from unexported header files in
    linux/include whose entire contents are wrapped in that preprocessor
    test.

    Signed-off-by: Robert P. J. Day
    Cc: David Woodhouse
    Cc: Sam Ravnborg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert P. J. Day
     
  • Fuse will use temporary buffers to write back dirty data from memory mappings
    (normal writes are done synchronously). This is needed, because there cannot
    be any guarantee about the time in which a write will complete.

    By using temporary buffers, from the MM's point if view the page is written
    back immediately. If the writeout was due to memory pressure, this
    effectively migrates data from a full zone to a less full zone.

    This patch adds a new counter (NR_WRITEBACK_TEMP) for the number of pages used
    as temporary buffers.

    [Lee.Schermerhorn@hp.com: add vmstat_text for NR_WRITEBACK_TEMP]
    Signed-off-by: Miklos Szeredi
    Cc: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

28 Apr, 2008

8 commits

  • This patch set is to free pages which is allocated by bootmem for
    memory-hotremove. Some structures of memory management are allocated by
    bootmem. ex) memmap, etc.

    To remove memory physically, some of them must be freed according to
    circumstance. This patch set makes basis to free those pages, and free
    memmaps.

    Basic my idea is using remain members of struct page to remember information
    of users of bootmem (section number or node id). When the section is
    removing, kernel can confirm it. By this information, some issues can be
    solved.

    1) When the memmap of removing section is allocated on other
    section by bootmem, it should/can be free.
    2) When the memmap of removing section is allocated on the
    same section, it shouldn't be freed. Because the section has to be
    logical memory offlined already and all pages must be isolated against
    page allocater. If it is freed, page allocator may use it which will
    be removed physically soon.
    3) When removing section has other section's memmap,
    kernel will be able to show easily which section should be removed
    before it for user. (Not implemented yet)
    4) When the above case 2), the page isolation will be able to check and skip
    memmap's page when logical memory offline (offline_pages()).
    Current page isolation code fails in this case because this page is
    just reserved page and it can't distinguish this pages can be
    removed or not. But, it will be able to do by this patch.
    (Not implemented yet.)
    5) The node information like pgdat has similar issues. But, this
    will be able to be solved too by this.
    (Not implemented yet, but, remembering node id in the pages.)

    Fortunately, current bootmem allocator just keeps PageReserved flags,
    and doesn't use any other members of page struct. The users of
    bootmem doesn't use them too.

    This patch:

    This is to register information which is node or section's id. Kernel can
    distinguish which node/section uses the pages allcated by bootmem. This is
    basis for hot-remove sections or nodes.

    Signed-off-by: Yasunori Goto
    Cc: Badari Pulavarty
    Cc: Yinghai Lu
    Cc: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • It was used to compensate because MAX_NR_ZONES was not available to the
    #ifdefs. Export MAX_NR_ZONES via the new mechanism and get rid of
    __ZONE_COUNT.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • NR_PAGEFLAGS specifies the number of page flags we are using. From that we
    can calculate the number of bits leftover that can be used for zone, node (and
    maybe the sections id). There is no need anymore for FLAGS_RESERVED if we use
    NR_PAGEFLAGS.

    Use the new methods to make NR_PAGEFLAGS available via the preprocessor.
    NR_PAGEFLAGS is used to calculate field boundaries in the page flags fields.
    These field widths have to be available to the preprocessor.

    Signed-off-by: Christoph Lameter
    Cc: David Miller
    Cc: Andy Whitcroft
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Fix this (sparc64)

    mm/sparse-vmemmap.c: In function `vmemmap_verify':
    mm/sparse-vmemmap.c:64: warning: unused variable `pfn'

    by switching to a C function which touches its arg.

    (reason 3,555 why macros are bad)

    Also, the `nid' arg was misnamed.

    Reviewed-by: Christoph Lameter
    Acked-by: Andy Whitcroft
    Cc: Mel Gorman
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The MPOL_BIND policy creates a zonelist that is used for allocations
    controlled by that mempolicy. As the per-node zonelist is already being
    filtered based on a zone id, this patch adds a version of __alloc_pages() that
    takes a nodemask for further filtering. This eliminates the need for
    MPOL_BIND to create a custom zonelist.

    A positive benefit of this is that allocations using MPOL_BIND now use the
    local node's distance-ordered zonelist instead of a custom node-id-ordered
    zonelist. I.e., pages will be allocated from the closest allowed node with
    available memory.

    [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
    [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
    [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Filtering zonelists requires very frequent use of zone_idx(). This is costly
    as it involves a lookup of another structure and a substraction operation. As
    the zone_idx is often required, it should be quickly accessible. The node idx
    could also be stored here if it was found that accessing zone->node is
    significant which may be the case on workloads where nodemasks are heavily
    used.

    This patch introduces a struct zoneref to store a zone pointer and a zone
    index. The zonelist then consists of an array of these struct zonerefs which
    are looked up as necessary. Helpers are given for accessing the zone index as
    well as the node index.

    [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
    [hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
    [hugh@veritas.com: just return do_try_to_free_pages]
    [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently a node has two sets of zonelists, one for each zone type in the
    system and a second set for GFP_THISNODE allocations. Based on the zones
    allowed by a gfp mask, one of these zonelists is selected. All of these
    zonelists consume memory and occupy cache lines.

    This patch replaces the multiple zonelists per-node with two zonelists. The
    first contains all populated zones in the system, ordered by distance, for
    fallback allocations when the target/preferred node has no free pages. The
    second contains all populated zones in the node suitable for GFP_THISNODE
    allocations.

    An iterator macro is introduced called for_each_zone_zonelist() that interates
    through each zone allowed by the GFP flags in the selected zonelist.

    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • include/linux/mmzone.h:640:22: warning: potentially expensive pointer subtraction

    Calculate the offset into the node_zones array rather than the index
    using casts to (char *) and comparing against the index * sizeof(struct zone).

    On X86_32 this saves a sar, but code size increases by one byte per
    is_highmem() use due to 32-bit cmps rather than 16 bit cmps.

    Before:
    207: 2b 80 8c 07 00 00 sub 0x78c(%eax),%eax
    20d: c1 f8 0b sar $0xb,%eax
    210: 83 f8 02 cmp $0x2,%eax
    213: 74 16 je 22b
    215: 83 f8 03 cmp $0x3,%eax
    218: 0f 85 8f 00 00 00 jne 2ad
    21e: 83 3d 00 00 00 00 02 cmpl $0x2,0x0
    225: 0f 85 82 00 00 00 jne 2ad
    22b: 64 a1 00 00 00 00 mov %fs:0x0,%eax

    After:
    207: 2b 80 8c 07 00 00 sub 0x78c(%eax),%eax
    20d: 3d 00 10 00 00 cmp $0x1000,%eax
    212: 74 18 je 22c
    214: 3d 00 18 00 00 cmp $0x1800,%eax
    219: 0f 85 8f 00 00 00 jne 2ae
    21f: 83 3d 00 00 00 00 02 cmpl $0x2,0x0
    226: 0f 85 82 00 00 00 jne 2ae
    22c: 64 a1 00 00 00 00 mov %fs:0x0,%eax

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Harvey Harrison
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     

22 Apr, 2008

1 commit


06 Feb, 2008

1 commit

  • We have repeatedly discussed if the cold pages still have a point. There is
    one way to join the two lists: Use a single list and put the cold pages at the
    end and the hot pages at the beginning. That way a single list can serve for
    both types of allocations.

    The discussion of the RFC for this and Mel's measurements indicate that
    there may not be too much of a point left to having separate lists for
    hot and cold pages (see http://marc.info/?t=119492914200001&r=1&w=2).

    Signed-off-by: Christoph Lameter
    Cc: Mel Gorman
    Cc: Martin Bligh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

17 Oct, 2007

3 commits

  • Introduces new zone flag interface for testing and setting flags:

    int zone_test_and_set_flag(struct zone *zone, zone_flags_t flag)

    Instead of setting and clearing ZONE_RECLAIM_LOCKED each time shrink_zone() is
    called, this flag is test and set before starting zone reclaim. Zone reclaim
    starts in __alloc_pages() when a zone's watermark fails and the system is in
    zone_reclaim_mode. If it's already in reclaim, there's no need to start again
    so it is simply considered full for that allocation attempt.

    There is a change of behavior with regard to concurrent zone shrinking. It is
    now possible for try_to_free_pages() or kswapd to already be shrinking a
    particular zone when __alloc_pages() starts zone reclaim. In this case, it is
    possible for two concurrent threads to invoke shrink_zone() for a single zone.

    This change forbids a zone to be in zone reclaim twice, which was always the
    behavior, but allows for concurrent try_to_free_pages() or kswapd shrinking
    when starting zone reclaim.

    Cc: Andrea Arcangeli
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • OOM killer synchronization should be done with zone granularity so that memory
    policy and cpuset allocations may have their corresponding zones locked and
    allow parallel kills for other OOM conditions that may exist elsewhere in the
    system. DMA allocations can be targeted at the zone level, which would not be
    possible if locking was done in nodes or globally.

    Synchronization shall be done with a variation of "trylocks." The goal is to
    put the current task to sleep and restart the failed allocation attempt later
    if the trylock fails. Otherwise, the OOM killer is invoked.

    Each zone in the zonelist that __alloc_pages() was called with is checked for
    the newly-introduced ZONE_OOM_LOCKED flag. If any zone has this flag present,
    the "trylock" to serialize the OOM killer fails and returns zero. Otherwise,
    all the zones have ZONE_OOM_LOCKED set and the try_set_zone_oom() function
    returns non-zero.

    Cc: Andrea Arcangeli
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Convert the int all_unreclaimable member of struct zone to unsigned long
    flags. This can now be used to specify several different zone flags such as
    all_unreclaimable and reclaim_in_progress, which can now be removed and
    converted to a per-zone flag.

    Flags are set and cleared as follows:

    zone_set_flag(struct zone *zone, zone_flags_t flag)
    zone_clear_flag(struct zone *zone, zone_flags_t flag)

    Defines the first zone flags, ZONE_ALL_UNRECLAIMABLE and ZONE_RECLAIM_LOCKED,
    which have the same semantics as the old zone->all_unreclaimable and
    zone->reclaim_in_progress, respectively. Also converts all current users that
    set or clear either flag to use the new interface.

    Helper functions are defined to test the flags:

    int zone_is_all_unreclaimable(const struct zone *zone)
    int zone_is_reclaim_locked(const struct zone *zone)

    All flag operators are of the atomic variety because there are currently
    readers that are implemented that do not take zone->lock.

    [akpm@linux-foundation.org: add needed include]
    Cc: Andrea Arcangeli
    Acked-by: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes