22 Sep, 2009

2 commits

  • page_is_file_cache() has been used for both boolean checks and LRU
    arithmetic, which was always a bit weird.

    Now that page_lru_base_type() exists for LRU arithmetic, make
    page_is_file_cache() a real predicate function and adjust the
    boolean-using callsites to drop those pesky double negations.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Instead of abusing page_is_file_cache() for LRU list index arithmetic, add
    another helper with a more appropriate name and convert the non-boolean
    users of page_is_file_cache() accordingly.

    This new helper gives the LRU base type a page is supposed to live on,
    inactive anon or inactive file.

    [hugh.dickins@tiscali.co.uk: convert del_page_from_lru() also]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

09 Jan, 2009

2 commits

  • The inactive_anon_is_low() is called only vmscan. Then it can move to
    vmscan.c

    This patch doesn't have any functional change.

    Reviewd-by: KAMEZAWA Hiroyuki
    Acked-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • A big patch for changing memcg's LRU semantics.

    Now,
    - page_cgroup is linked to mem_cgroup's its own LRU (per zone).

    - LRU of page_cgroup is not synchronous with global LRU.

    - page and page_cgroup is one-to-one and statically allocated.

    - To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as
    - lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc);

    - SwapCache is handled.

    And, when we handle LRU list of page_cgroup, we do following.

    pc = lookup_page_cgroup(page);
    lock_page_cgroup(pc); .....................(1)
    mz = page_cgroup_zoneinfo(pc);
    spin_lock(&mz->lru_lock);
    .....add to LRU
    spin_unlock(&mz->lru_lock);
    unlock_page_cgroup(pc);

    But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock.
    So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct.

    This is a trial to remove this dirty nesting of locks.
    This patch changes mz->lru_lock to be zone->lru_lock.
    Then, above sequence will be written as

    spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
    mem_cgroup_add/remove/etc_lru() {
    pc = lookup_page_cgroup(page);
    mz = page_cgroup_zoneinfo(pc);
    if (PageCgroupUsed(pc)) {
    ....add to LRU
    }
    spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU

    This is much simpler.
    (*) We're safe even if we don't take lock_page_cgroup(pc). Because..
    1. When pc->mem_cgroup can be modified.
    - at charge.
    - at account_move().
    2. at charge
    the PCG_USED bit is not set before pc->mem_cgroup is fixed.
    3. at account_move()
    the page is isolated and not on LRU.

    Pros.
    - easy for maintenance.
    - memcg can make use of laziness of pagevec.
    - we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup.
    - LRU status of memcg will be synchronized with global LRU's one.
    - # of locks are reduced.
    - account_move() is simplified very much.
    Cons.
    - may increase cost of LRU rotation.
    (no impact if memcg is not configured.)

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

20 Oct, 2008

6 commits

  • Several LRU manupuration function are not used now. So they can be
    removed.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • When the system contains lots of mlocked or otherwise unevictable pages,
    the pageout code (kswapd) can spend lots of time scanning over these
    pages. Worse still, the presence of lots of unevictable pages can confuse
    kswapd into thinking that more aggressive pageout modes are required,
    resulting in all kinds of bad behaviour.

    Infrastructure to manage pages excluded from reclaim--i.e., hidden from
    vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked to
    maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
    them from vmscan.

    Kosaki Motohiro added the support for the memory controller unevictable
    lru list.

    Pages on the unevictable list have both PG_unevictable and PG_lru set.
    Thus, PG_unevictable is analogous to and mutually exclusive with
    PG_active--it specifies which LRU list the page is on.

    The unevictable infrastructure is enabled by a new mm Kconfig option
    [CONFIG_]UNEVICTABLE_LRU.

    A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
    not a page may be evictable. Subsequent patches will add the various
    !evictable tests. We'll want to keep these tests light-weight for use in
    shrink_active_list() and, possibly, the fault path.

    To avoid races between tasks putting pages [back] onto an LRU list and
    tasks that might be moving the page from non-evictable to evictable state,
    the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
    -- tests the "evictability" of a page after placing it on the LRU, before
    dropping the reference. If the page has become unevictable,
    putback_lru_page() will redo the 'putback', thus moving the page to the
    unevictable list. This way, we avoid "stranding" evictable pages on the
    unevictable list.

    [akpm@linux-foundation.org: fix fallout from out-of-order merge]
    [riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
    [nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
    [kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
    [kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
    [kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
    [kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
    [kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Debugged-by: Benjamin Kidwell
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • We avoid evicting and scanning anonymous pages for the most part, but
    under some workloads we can end up with most of memory filled with
    anonymous pages. At that point, we suddenly need to clear the referenced
    bits on all of memory, which can take ages on very large memory systems.

    We can reduce the maximum number of pages that need to be scanned by not
    taking the referenced state into account when deactivating an anonymous
    page. After all, every anonymous page starts out referenced, so why
    check?

    If an anonymous page gets referenced again before it reaches the end of
    the inactive list, we move it back to the active list.

    To keep the maximum amount of necessary work reasonable, we scale the
    active to inactive ratio with the size of memory, using the formula
    active:inactive ratio = sqrt(memory in GB * 10).

    Kswapd CPU use now seems to scale by the amount of pageout bandwidth,
    instead of by the amount of memory present in the system.

    [kamezawa.hiroyu@jp.fujitsu.com: fix OOM with memcg]
    [kamezawa.hiroyu@jp.fujitsu.com: memcg: lru scan fix]
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Split the LRU lists in two, one set for pages that are backed by real file
    systems ("file") and one for pages that are backed by memory and swap
    ("anon"). The latter includes tmpfs.

    The advantage of doing this is that the VM will not have to scan over lots
    of anonymous pages (which we generally do not want to swap out), just to
    find the page cache pages that it should evict.

    This patch has the infrastructure and a basic policy to balance how much
    we scan the anon lists and how much we scan the file lists. The big
    policy changes are in separate patches.

    [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
    [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
    [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
    [hugh@veritas.com: memcg swapbacked pages active]
    [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
    [akpm@linux-foundation.org: fix /proc/vmstat units]
    [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
    [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
    [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Define page_file_cache() function to answer the question:
    is page backed by a file?

    Originally part of Rik van Riel's split-lru patch. Extracted to make
    available for other, independent reclaim patches.

    Moved inline function to linux/mm_inline.h where it will be needed by
    subsequent "split LRU" and "noreclaim" patches.

    Unfortunately this needs to use a page flag, since the PG_swapbacked state
    needs to be preserved all the way to the point where the page is last
    removed from the LRU. Trying to derive the status from other info in the
    page resulted in wrong VM statistics in earlier split VM patchsets.

    The total number of page flags in use on a 32 bit machine after this patch
    is 19.

    [akpm@linux-foundation.org: fix up out-of-order merge fallout]
    [hugh@veritas.com: splitlru: shmem_getpage SetPageSwapBacked sooner[
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: MinChan Kim
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Currently we are defining explicit variables for the inactive and active
    list. An indexed array can be more generic and avoid repeating similar
    code in several places in the reclaim code.

    We are saving a few bytes in terms of code size:

    Before:

    text data bss dec hex filename
    4097753 573120 4092484 8763357 85b7dd vmlinux

    After:

    text data bss dec hex filename
    4097729 573120 4092484 8763333 85b7c5 vmlinux

    Having an easy way to add new lru lists may ease future work on the
    reclaim code.

    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Christoph Lameter
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

12 Feb, 2007

1 commit

  • The determination of the dirty ratio to determine writeback behavior is
    currently based on the number of total pages on the system.

    However, not all pages in the system may be dirtied. Thus the ratio is always
    too low and can never reach 100%. The ratio may be particularly skewed if
    large hugepage allocations, slab allocations or device driver buffers make
    large sections of memory not available anymore. In that case we may get into
    a situation in which f.e. the background writeback ratio of 40% cannot be
    reached anymore which leads to undesired writeback behavior.

    This patchset fixes that issue by determining the ratio based on the actual
    pages that may potentially be dirty. These are the pages on the active and
    the inactive list plus free pages.

    The problem with those counts has so far been that it is expensive to
    calculate these because counts from multiple nodes and multiple zones will
    have to be summed up. This patchset makes these counters ZVC counters. This
    means that a current sum per zone, per node and for the whole system is always
    available via global variables and not expensive anymore to calculate.

    The patchset results in some other good side effects:

    - Removal of the various functions that sum up free, active and inactive
    page counts

    - Cleanup of the functions that display information via the proc filesystem.

    This patch:

    The use of a ZVC for nr_inactive and nr_active allows a simplification of some
    counter operations. More ZVC functionality is used for sums etc in the
    following patches.

    [akpm@osdl.org: UP build fix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

22 Mar, 2006

1 commit

  • In the page release paths, we can be sure that nobody will mess with our
    page->flags because the refcount has dropped to 0. So no need for atomic
    operations here.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

19 Jan, 2006

1 commit

  • Migration code currently does not take a reference to target page
    properly, so between unlocking the pte and trying to take a new
    reference to the page with isolate_lru_page, anything could happen to
    it.

    Fix this by holding the pte lock until we get a chance to elevate the
    refcount.

    Other small cleanups while we're here.

    Signed-off-by: Nick Piggin
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

09 Jan, 2006

1 commit

  • This is the start of the `swap migration' patch series.

    Swap migration allows the moving of the physical location of pages between
    nodes in a numa system while the process is running. This means that the
    virtual addresses that the process sees do not change. However, the system
    rearranges the physical location of those pages.

    The main intent of page migration patches here is to reduce the latency of
    memory access by moving pages near to the processor where the process
    accessing that memory is running.

    The patchset allows a process to manually relocate the node on which its
    pages are located through the MF_MOVE and MF_MOVE_ALL options while
    setting a new memory policy.

    The pages of process can also be relocated from another process using the
    sys_migrate_pages() function call. Requires CAP_SYS_ADMIN. The migrate_pages
    function call takes two sets of nodes and moves pages of a process that are
    located on the from nodes to the destination nodes.

    Manual migration is very useful if for example the scheduler has relocated a
    process to a processor on a distant node. A batch scheduler or an
    administrator can detect the situation and move the pages of the process
    nearer to the new processor.

    sys_migrate_pages() could be used on non-numa machines as well, to force all
    of a particualr process's pages out to swap, if someone thinks that's useful.

    Larger installations usually partition the system using cpusets into sections
    of nodes. Paul has equipped cpusets with the ability to move pages when a
    task is moved to another cpuset. This allows automatic control over locality
    of a process. If a task is moved to a new cpuset then also all its pages are
    moved with it so that the performance of the process does not sink
    dramatically (as is the case today).

    Swap migration works by simply evicting the page. The pages must be faulted
    back in. The pages are then typically reallocated by the system near the node
    where the process is executing.

    For swap migration the destination of the move is controlled by the allocation
    policy. Cpusets set the allocation policy before calling sys_migrate_pages()
    in order to move the pages as intended.

    No allocation policy changes are performed for sys_migrate_pages(). This
    means that the pages may not faulted in to the specified nodes if no
    allocation policy was set by other means. The pages will just end up near the
    node where the fault occurred.

    There's another patch series in the pipeline which implements "direct
    migration".

    The direct migration patchset extends the migration functionality to avoid
    going through swap. The destination node of the relation is controllable
    during the actual moving of pages. The crutch of using the allocation policy
    to relocate is not necessary and the pages are moved directly to the target.
    Its also faster since swap is not used.

    And sys_migrate_pages() can then move pages directly to the specified node.
    Implement functions to isolate pages from the LRU and put them back later.

    This patch:

    An earlier implementation was provided by Hirokazu Takahashi
    and IWAMOTO Toshihiro for the
    memory hotplug project.

    From: Magnus

    This breaks out isolate_lru_page() and putpack_lru_page(). Needed for swap
    migration.

    Signed-off-by: Magnus Damm
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds