09 Jan, 2009

16 commits

  • Now, you can see following even when swap accounting is enabled.

    1. Create Group 01, and 02.
    2. allocate a "file" on tmpfs by a task under 01.
    3. swap out the "file" (by memory pressure)
    4. Read "file" from a task in group 02.
    5. the charge of "file" is moved to group 02.

    This is not ideal behavior. This is because SwapCache which was loaded
    by read-ahead is not taken into account..

    This is a patch to fix shmem's swapcache behavior.
    - remove mem_cgroup_cache_charge_swapin().
    - Add SwapCache handler routine to mem_cgroup_cache_charge().
    By this, shmem's file cache is charged at add_to_page_cache()
    with GFP_NOWAIT.
    - pass the page of swapcache to shrink_mem_cgroup.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • After previous patch, mem_cgroup_try_charge is not used by anyone, so we
    can remove it.

    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Currently, inactive_ratio of memcg is calculated at setting limit.
    because page_alloc.c does so and current implementation is straightforward
    porting.

    However, memcg introduced hierarchy feature recently. In hierarchy
    restriction, memory limit is not only decided memory.limit_in_bytes of
    current cgroup, but also parent limit and sibling memory usage.

    Then, The optimal inactive_ratio is changed frequently. So, everytime
    calculation is better.

    Tested-by: KAMEZAWA Hiroyuki
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: KOSAKI Motohiro
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Now, get_scan_ratio() return correct value although memcg reclaim. Then,
    mem_cgroup_calc_reclaim() can be removed.

    So, memcg reclaim get the same capability of anon/file reclaim balancing
    as global reclaim now.

    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Introduce mem_cgroup_per_zone::reclaim_stat member and its statics
    collecting function.

    Now, get_scan_ratio() can calculate correct value on memcg reclaim.

    [hugh@veritas.com: avoid reclaim_stat oops when disabled]
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Introduce mem_cgroup_zone_nr_pages(). It is called by zone_nr_pages()
    helper function.

    This patch doesn't have any behavior change.

    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Acked-by: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The inactive_anon_is_low() is key component of active/inactive anon
    balancing on reclaim. However current inactive_anon_is_low() function
    only consider global reclaim.

    Therefore, we need following ugly scan_global_lru() condition.

    if (lru == LRU_ACTIVE_ANON &&
    (!scan_global_lru(sc) || inactive_anon_is_low(zone))) {
    shrink_active_list(nr_to_scan, zone, sc, priority, file);
    return 0;

    it cause that memcg reclaim always deactivate pages when shrink_list() is
    called. To make mem_cgroup_inactive_anon_is_low() improve active/inactive
    anon balancing of memcgroup.

    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Cc: Cyrill Gorcunov
    Cc: "Pekka Enberg"
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • My patch, memcg-fix-gfp_mask-of-callers-of-charge.patch changed gfp_mask
    of callers of charge to be GFP_HIGHUSER_MOVABLE for showing what will
    happen at memory reclaim.

    But in recent discussion, it's NACKed because it sounds ugly.

    This patch is for reverting it and add some clean up to gfp_mask of
    callers of charge. No behavior change but need review before generating
    HUNK in deep queue.

    This patch also adds explanation to meaning of gfp_mask passed to charge
    functions in memcontrol.h.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Current mmtom has new oom function as pagefault_out_of_memory(). It's
    added for select bad process rathar than killing current.

    When memcg hit limit and calls OOM at page_fault, this handler called and
    system-wide-oom handling happens. (means kernel panics if panic_on_oom is
    true....)

    To avoid overkill, check memcg's recent behavior before starting
    system-wide-oom.

    And this patch also fixes to guarantee "don't accnout against process with
    TIF_MEMDIE". This is necessary for smooth OOM.

    [akpm@linux-foundation.org: build fix]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Badari Pulavarty
    Cc: Jan Blunck
    Cc: Hirokazu Takahashi
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • mm_match_cgroup() calls cgroup_subsys_state().

    We must use rcu_read_lock() to protect cgroup_subsys_state().

    Signed-off-by: Lai Jiangshan
    Cc: Paul Menage
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Pavel Emelyanov
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • We check mem_cgroup is disabled or not by checking
    mem_cgroup_subsys.disabled. I think it has more references than expected,
    now.

    replacing
    if (mem_cgroup_subsys.disabled)
    with
    if (mem_cgroup_disabled())

    give us good look, I think.

    [kamezawa.hiroyu@jp.fujitsu.com: fix typo]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hirokazu Takahashi
     
  • A big patch for changing memcg's LRU semantics.

    Now,
    - page_cgroup is linked to mem_cgroup's its own LRU (per zone).

    - LRU of page_cgroup is not synchronous with global LRU.

    - page and page_cgroup is one-to-one and statically allocated.

    - To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as
    - lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc);

    - SwapCache is handled.

    And, when we handle LRU list of page_cgroup, we do following.

    pc = lookup_page_cgroup(page);
    lock_page_cgroup(pc); .....................(1)
    mz = page_cgroup_zoneinfo(pc);
    spin_lock(&mz->lru_lock);
    .....add to LRU
    spin_unlock(&mz->lru_lock);
    unlock_page_cgroup(pc);

    But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock.
    So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct.

    This is a trial to remove this dirty nesting of locks.
    This patch changes mz->lru_lock to be zone->lru_lock.
    Then, above sequence will be written as

    spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
    mem_cgroup_add/remove/etc_lru() {
    pc = lookup_page_cgroup(page);
    mz = page_cgroup_zoneinfo(pc);
    if (PageCgroupUsed(pc)) {
    ....add to LRU
    }
    spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU

    This is much simpler.
    (*) We're safe even if we don't take lock_page_cgroup(pc). Because..
    1. When pc->mem_cgroup can be modified.
    - at charge.
    - at account_move().
    2. at charge
    the PCG_USED bit is not set before pc->mem_cgroup is fixed.
    3. at account_move()
    the page is isolated and not on LRU.

    Pros.
    - easy for maintenance.
    - memcg can make use of laziness of pagevec.
    - we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup.
    - LRU status of memcg will be synchronized with global LRU's one.
    - # of locks are reduced.
    - account_move() is simplified very much.
    Cons.
    - may increase cost of LRU rotation.
    (no impact if memcg is not configured.)

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch implements per cgroup limit for usage of memory+swap. However
    there are SwapCache, double counting of swap-cache and swap-entry is
    avoided.

    Mem+Swap controller works as following.
    - memory usage is limited by memory.limit_in_bytes.
    - memory + swap usage is limited by memory.memsw_limit_in_bytes.

    This has following benefits.
    - A user can limit total resource usage of mem+swap.

    Without this, because memory resource controller doesn't take care of
    usage of swap, a process can exhaust all the swap (by memory leak.)
    We can avoid this case.

    And Swap is shared resource but it cannot be reclaimed (goes back to memory)
    until it's used. This characteristic can be trouble when the memory
    is divided into some parts by cpuset or memcg.
    Assume group A and group B.
    After some application executes, the system can be..

    Group A -- very large free memory space but occupy 99% of swap.
    Group B -- under memory shortage but cannot use swap...it's nearly full.

    Ability to set appropriate swap limit for each group is required.

    Maybe someone wonder "why not swap but mem+swap ?"

    - The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
    to move account from memory to swap...there is no change in usage of
    mem+swap.

    In other words, when we want to limit the usage of swap without affecting
    global LRU, mem+swap limit is better than just limiting swap.

    Accounting target information is stored in swap_cgroup which is
    per swap entry record.

    Charge is done as following.
    map
    - charge page and memsw.

    unmap
    - uncharge page/memsw if not SwapCache.

    swap-out (__delete_from_swap_cache)
    - uncharge page
    - record mem_cgroup information to swap_cgroup.

    swap-in (do_swap_page)
    - charged as page and memsw.
    record in swap_cgroup is cleared.
    memsw accounting is decremented.

    swap-free (swap_free())
    - if swap entry is freed, memsw is uncharged by PAGE_SIZE.

    There are people work under never-swap environments and consider swap as
    something bad. For such people, this mem+swap controller extension is just an
    overhead. This overhead is avoided by config or boot option.
    (see Kconfig. detail is not in this patch.)

    TODO:
    - maybe more optimization can be don in swap-in path. (but not very safe.)
    But we just do simple accounting at this stage.

    [nishimura@mxp.nes.nec.co.jp: make resize limit hold mutex]
    [hugh@veritas.com: memswap controller core swapcache fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Config and control variable for mem+swap controller.

    This patch adds CONFIG_CGROUP_MEM_RES_CTLR_SWAP
    (memory resource controller swap extension.)

    For accounting swap, it's obvious that we have to use additional memory to
    remember "who uses swap". This adds more overhead. So, it's better to
    offer "choice" to users. This patch adds 2 choices.

    This patch adds 2 parameters to enable swap extension or not.
    - CONFIG
    - boot option

    Reviewed-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Now, management of "charge" under page migration is done under following
    manner. (Assume migrate page contents from oldpage to newpage)

    before
    - "newpage" is charged before migration.
    at success.
    - "oldpage" is uncharged at somewhere(unmap, radix-tree-replace)
    at failure
    - "newpage" is uncharged.
    - "oldpage" is charged if necessary (*1)

    But (*1) is not reliable....because of GFP_ATOMIC.

    This patch tries to change behavior as following by charge/commit/cancel ops.

    before
    - charge PAGE_SIZE (no target page)
    success
    - commit charge against "newpage".
    failure
    - commit charge against "oldpage".
    (PCG_USED bit works effectively to avoid double-counting)
    - if "oldpage" is obsolete, cancel charge of PAGE_SIZE.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • There is a small race in do_swap_page(). When the page swapped-in is
    charged, the mapcount can be greater than 0. But, at the same time some
    process (shares it ) call unmap and make mapcount 1->0 and the page is
    uncharged.

    CPUA CPUB
    mapcount == 1.
    (1) charge if mapcount==0 zap_pte_range()
    (2) mapcount 1 => 0.
    (3) uncharge(). (success)
    (4) set page's rmap()
    mapcount 0=>1

    Then, this swap page's account is leaked.

    For fixing this, I added a new interface.
    - charge
    account to res_counter by PAGE_SIZE and try to free pages if necessary.
    - commit
    register page_cgroup and add to LRU if necessary.
    - cancel
    uncharge PAGE_SIZE because of do_swap_page failure.

    CPUA
    (1) charge (always)
    (2) set page's rmap (mapcount > 0)
    (3) commit charge was necessary or not after set_pte().

    This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting.
    Usual mem_cgroup_charge_common() does charge -> commit at a time.

    And this patch also adds following function to clarify all charges.

    - mem_cgroup_newpage_charge() ....replacement for mem_cgroup_charge()
    called against newly allocated anon pages.

    - mem_cgroup_charge_migrate_fixup()
    called only from remove_migration_ptes().
    we'll have to rewrite this later.(this patch just keeps old behavior)
    This function will be removed by additional patch to make migration
    clearer.

    Good for clarifying "what we do"

    Then, we have 4 following charge points.
    - newpage
    - swap-in
    - add-to-cache.
    - migration.

    [akpm@linux-foundation.org: add missing inline directives to stubs]
    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

20 Oct, 2008

4 commits

  • Allocate all page_cgroup at boot and remove page_cgroup poitner from
    struct page. This patch adds an interface as

    struct page_cgroup *lookup_page_cgroup(struct page*)

    All FLATMEM/DISCONTIGMEM/SPARSEMEM and MEMORY_HOTPLUG is supported.

    Remove page_cgroup pointer reduces the amount of memory by
    - 4 bytes per PAGE_SIZE.
    - 8 bytes per PAGE_SIZE
    if memory controller is disabled. (even if configured.)

    On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory.
    On my x86-64 server with 48GB of memory, this saves 96MB of memory.
    I think this reduction makes sense.

    By pre-allocation, kmalloc/kfree in charge/uncharge are removed.
    This means
    - we're not necessary to be afraid of kmalloc faiulre.
    (this can happen because of gfp_mask type.)
    - we can avoid calling kmalloc/kfree.
    - we can avoid allocating tons of small objects which can be fragmented.
    - we can know what amount of memory will be used for this extra-lru handling.

    I added printk message as

    "allocated %ld bytes of page_cgroup"
    "please try cgroup_disable=memory option if you don't want"

    maybe enough informative for users.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • When the system contains lots of mlocked or otherwise unevictable pages,
    the pageout code (kswapd) can spend lots of time scanning over these
    pages. Worse still, the presence of lots of unevictable pages can confuse
    kswapd into thinking that more aggressive pageout modes are required,
    resulting in all kinds of bad behaviour.

    Infrastructure to manage pages excluded from reclaim--i.e., hidden from
    vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked to
    maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
    them from vmscan.

    Kosaki Motohiro added the support for the memory controller unevictable
    lru list.

    Pages on the unevictable list have both PG_unevictable and PG_lru set.
    Thus, PG_unevictable is analogous to and mutually exclusive with
    PG_active--it specifies which LRU list the page is on.

    The unevictable infrastructure is enabled by a new mm Kconfig option
    [CONFIG_]UNEVICTABLE_LRU.

    A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
    not a page may be evictable. Subsequent patches will add the various
    !evictable tests. We'll want to keep these tests light-weight for use in
    shrink_active_list() and, possibly, the fault path.

    To avoid races between tasks putting pages [back] onto an LRU list and
    tasks that might be moving the page from non-evictable to evictable state,
    the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
    -- tests the "evictability" of a page after placing it on the LRU, before
    dropping the reference. If the page has become unevictable,
    putback_lru_page() will redo the 'putback', thus moving the page to the
    unevictable list. This way, we avoid "stranding" evictable pages on the
    unevictable list.

    [akpm@linux-foundation.org: fix fallout from out-of-order merge]
    [riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
    [nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
    [kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
    [kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
    [kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
    [kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
    [kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Debugged-by: Benjamin Kidwell
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Split the LRU lists in two, one set for pages that are backed by real file
    systems ("file") and one for pages that are backed by memory and swap
    ("anon"). The latter includes tmpfs.

    The advantage of doing this is that the VM will not have to scan over lots
    of anonymous pages (which we generally do not want to swap out), just to
    find the page cache pages that it should evict.

    This patch has the infrastructure and a basic policy to balance how much
    we scan the anon lists and how much we scan the file lists. The big
    policy changes are in separate patches.

    [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
    [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
    [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
    [hugh@veritas.com: memcg swapbacked pages active]
    [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
    [akpm@linux-foundation.org: fix /proc/vmstat units]
    [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
    [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
    [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Currently we are defining explicit variables for the inactive and active
    list. An indexed array can be more generic and avoid repeating similar
    code in several places in the reclaim code.

    We are saving a few bytes in terms of code size:

    Before:

    text data bss dec hex filename
    4097753 573120 4092484 8763357 85b7dd vmlinux

    After:

    text data bss dec hex filename
    4097729 573120 4092484 8763333 85b7c5 vmlinux

    Having an easy way to add new lru lists may ease future work on the
    reclaim code.

    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Christoph Lameter
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

26 Jul, 2008

3 commits

  • A new call, mem_cgroup_shrink_usage() is added for shmem handling and
    relacing non-standard usage of mem_cgroup_charge/uncharge.

    Now, shmem calls mem_cgroup_charge() just for reclaim some pages from
    mem_cgroup. In general, shmem is used by some process group and not for
    global resource (like file caches). So, it's reasonable to reclaim pages
    from mem_cgroup where shmem is mainly used.

    [hugh@veritas.com: shmem_getpage release page sooner]
    [hugh@veritas.com: mem_cgroup_shrink_usage css_put]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Li Zefan
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Cc: David Rientjes
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • memcg: performance improvements

    Patch Description
    1/5 ... remove refcnt fron page_cgroup patch (shmem handling is fixed)
    2/5 ... swapcache handling patch
    3/5 ... add helper function for shmem's memory reclaim patch
    4/5 ... optimize by likely/unlikely ppatch
    5/5 ... remove redundunt check patch (shmem handling is fixed.)

    Unix bench result.

    == 2.6.26-rc2-mm1 + memory resource controller
    Execl Throughput 2915.4 lps (29.6 secs, 3 samples)
    C Compiler Throughput 1019.3 lpm (60.0 secs, 3 samples)
    Shell Scripts (1 concurrent) 5796.0 lpm (60.0 secs, 3 samples)
    Shell Scripts (8 concurrent) 1097.7 lpm (60.0 secs, 3 samples)
    Shell Scripts (16 concurrent) 565.3 lpm (60.0 secs, 3 samples)
    File Read 1024 bufsize 2000 maxblocks 1022128.0 KBps (30.0 secs, 3 samples)
    File Write 1024 bufsize 2000 maxblocks 544057.0 KBps (30.0 secs, 3 samples)
    File Copy 1024 bufsize 2000 maxblocks 346481.0 KBps (30.0 secs, 3 samples)
    File Read 256 bufsize 500 maxblocks 319325.0 KBps (30.0 secs, 3 samples)
    File Write 256 bufsize 500 maxblocks 148788.0 KBps (30.0 secs, 3 samples)
    File Copy 256 bufsize 500 maxblocks 99051.0 KBps (30.0 secs, 3 samples)
    File Read 4096 bufsize 8000 maxblocks 2058917.0 KBps (30.0 secs, 3 samples)
    File Write 4096 bufsize 8000 maxblocks 1606109.0 KBps (30.0 secs, 3 samples)
    File Copy 4096 bufsize 8000 maxblocks 854789.0 KBps (30.0 secs, 3 samples)
    Dc: sqrt(2) to 99 decimal places 126145.2 lpm (30.0 secs, 3 samples)

    INDEX VALUES
    TEST BASELINE RESULT INDEX

    Execl Throughput 43.0 2915.4 678.0
    File Copy 1024 bufsize 2000 maxblocks 3960.0 346481.0 875.0
    File Copy 256 bufsize 500 maxblocks 1655.0 99051.0 598.5
    File Copy 4096 bufsize 8000 maxblocks 5800.0 854789.0 1473.8
    Shell Scripts (8 concurrent) 6.0 1097.7 1829.5
    =========
    FINAL SCORE 991.3

    == 2.6.26-rc2-mm1 + this set ==
    Execl Throughput 3012.9 lps (29.9 secs, 3 samples)
    C Compiler Throughput 981.0 lpm (60.0 secs, 3 samples)
    Shell Scripts (1 concurrent) 5872.0 lpm (60.0 secs, 3 samples)
    Shell Scripts (8 concurrent) 1120.3 lpm (60.0 secs, 3 samples)
    Shell Scripts (16 concurrent) 578.0 lpm (60.0 secs, 3 samples)
    File Read 1024 bufsize 2000 maxblocks 1003993.0 KBps (30.0 secs, 3 samples)
    File Write 1024 bufsize 2000 maxblocks 550452.0 KBps (30.0 secs, 3 samples)
    File Copy 1024 bufsize 2000 maxblocks 347159.0 KBps (30.0 secs, 3 samples)
    File Read 256 bufsize 500 maxblocks 314644.0 KBps (30.0 secs, 3 samples)
    File Write 256 bufsize 500 maxblocks 151852.0 KBps (30.0 secs, 3 samples)
    File Copy 256 bufsize 500 maxblocks 101000.0 KBps (30.0 secs, 3 samples)
    File Read 4096 bufsize 8000 maxblocks 2033256.0 KBps (30.0 secs, 3 samples)
    File Write 4096 bufsize 8000 maxblocks 1611814.0 KBps (30.0 secs, 3 samples)
    File Copy 4096 bufsize 8000 maxblocks 847979.0 KBps (30.0 secs, 3 samples)
    Dc: sqrt(2) to 99 decimal places 128148.7 lpm (30.0 secs, 3 samples)

    INDEX VALUES
    TEST BASELINE RESULT INDEX

    Execl Throughput 43.0 3012.9 700.7
    File Copy 1024 bufsize 2000 maxblocks 3960.0 347159.0 876.7
    File Copy 256 bufsize 500 maxblocks 1655.0 101000.0 610.3
    File Copy 4096 bufsize 8000 maxblocks 5800.0 847979.0 1462.0
    Shell Scripts (8 concurrent) 6.0 1120.3 1867.2
    =========
    FINAL SCORE 1004.6

    This patch:

    Remove refcnt from page_cgroup().

    After this,

    * A page is charged only when !page_mapped() && no page_cgroup is assigned.
    * Anon page is newly mapped.
    * File page is added to mapping->tree.

    * A page is uncharged only when
    * Anon page is fully unmapped.
    * File page is removed from LRU.

    There is no change in behavior from user's view.

    This patch also removes unnecessary calls in rmap.c which was used only for
    refcnt mangement.

    [akpm@linux-foundation.org: fix warning]
    [hugh@veritas.com: fix shmem_unuse_inode charging]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Li Zefan
    Cc: Hugh Dickins
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Cc: David Rientjes
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch changes page migration under memory controller to use a
    different algorithm. (thanks to Christoph for new idea.)

    Before:
    - page_cgroup is migrated from an old page to a new page.
    After:
    - a new page is accounted , no reuse of page_cgroup.

    Pros:

    - We can avoid compliated lock depndencies and races in migration.

    Cons:

    - new param to mem_cgroup_charge_common().

    - mem_cgroup_getref() is added for handling ref_cnt ping-pong.

    This version simplifies complicated lock dependency in page migraiton
    under memory resource controller.

    new refcnt sequence is following.

    a mapped page:
    prepage_migration() ..... +1 to NEW page
    try_to_unmap() ..... all refs to OLD page is gone.
    move_pages() ..... +1 to NEW page if page cache.
    remap... ..... all refs from *map* is added to NEW one.
    end_migration() ..... -1 to New page.

    page's mapcount + (page_is_cache) refs are added to NEW one.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: Li Zefan
    Cc: YAMAMOTO Takashi
    Cc: Hugh Dickins
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

29 Apr, 2008

1 commit

  • Remove the mem_cgroup member from mm_struct and instead adds an owner.

    This approach was suggested by Paul Menage. The advantage of this approach
    is that, once the mm->owner is known, using the subsystem id, the cgroup
    can be determined. It also allows several control groups that are
    virtually grouped by mm_struct, to exist independent of the memory
    controller i.e., without adding mem_cgroup's for each controller, to
    mm_struct.

    A new config option CONFIG_MM_OWNER is added and the memory resource
    controller selects this config option.

    This patch also adds cgroup callbacks to notify subsystems when mm->owner
    changes. The mm_cgroup_changed callback is called with the task_lock() of
    the new task held and is called just prior to changing the mm->owner.

    I am indebted to Paul Menage for the several reviews of this patchset and
    helping me make it lighter and simpler.

    This patch was tested on a powerpc box, it was compiled with both the
    MM_OWNER config turned on and off.

    After the thread group leader exits, it's moved to init_css_state by
    cgroup_exit(), thus all future charges from runnings threads would be
    redirected to the init_css_set's subsystem.

    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Hugh Dickins
    Cc: Sudhir Kumar
    Cc: YAMAMOTO Takashi
    Cc: Hirokazu Takahashi
    Cc: David Rientjes ,
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Pekka Enberg
    Reviewed-by: Paul Menage
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

05 Mar, 2008

5 commits

  • Nothing uses mem_cgroup_uncharge apart from mem_cgroup_uncharge_page, (a
    trivial wrapper around it) and mem_cgroup_end_migration (which does the same
    as mem_cgroup_uncharge_page). And it often ends up having to lock just to let
    its caller unlock. Remove it (but leave the silly locking until a later
    patch).

    Moved mem_cgroup_cache_charge next to mem_cgroup_charge in memcontrol.h.

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Replace free_hot_cold_page's VM_BUG_ON(page_get_page_cgroup(page)) by a "Bad
    page state" and clear: most users don't have CONFIG_DEBUG_VM on, and if it
    were set here, it'd likely cause corruption when the page is reused.

    Don't use page_assign_page_cgroup to clear it: that should be private to
    memcontrol.c, and always called with the lock taken; and memmap_init_zone
    doesn't need it either - like page->mapping and other pointers throughout the
    kernel, Linux assumes pointers in zeroed structures are NULL pointers.

    Instead use page_reset_bad_cgroup, added to memcontrol.h for this only.

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Each caller of mem_cgroup_move_lists is having to use page_get_page_cgroup:
    it's more convenient if it acts upon the page itself not the page_cgroup; and
    in a later patch this becomes important to handle within memcontrol.c.

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • vm_match_cgroup is a perverse name for a macro to match mm with cgroup: rename
    it mm_match_cgroup, matching mm_init_cgroup and mm_free_cgroup.

    Signed-off-by: Hugh Dickins
    Acked-by: David Rientjes
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Rename Memory Controller to Memory Resource Controller. Reflect the same
    changes in the CONFIG definition for the Memory Resource Controller. Group
    together the config options for Resource Counters and Memory Resource
    Controller.

    Signed-off-by: Balbir Singh
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

24 Feb, 2008

1 commit

  • Fix build failure on sparc:

    In file included from include/linux/mm.h:39,
    from include/linux/memcontrol.h:24,
    from include/linux/swap.h:8,
    from include/linux/suspend.h:7,
    from init/do_mounts.c:6:
    include/asm/pgtable.h:344: warning: parameter names (without
    types) in function declaration
    include/asm/pgtable.h:345: warning: parameter names (without
    types) in function declaration
    include/asm/pgtable.h:346: error: expected '=', ',', ';', 'asm' or
    '__attribute__' before '___f___swp_entry'

    viro sayeth:

    I've run allmodconfig builds on a bunch of target, FWIW (essentially the
    same patch). Note that these includes are recent addition caused by added
    inline function that had since then become a define. So while I agree with
    your comments in general, in _this_ case it's pretty safe.

    The commit that had done it is 3062fc67dad01b1d2a15d58c709eff946389eca4
    ("memcontrol: move mm_cgroup to header file") and the switch to #define
    is in commit 60c12b1202a60eabb1c61317e5d2678fcea9893f ("memcontrol: add
    vm_match_cgroup()") (BTW, that probably warranted mentioning in the
    changelog of the latter).

    Cc: Adrian Bunk
    Cc: Robert Reif
    Signed-off-by: David Rientjes
    Cc: "David S. Miller"
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

10 Feb, 2008

1 commit

  • mm_cgroup() is exclusively used to test whether an mm's mem_cgroup pointer
    is pointing to a specific cgroup. Instead of returning the pointer, we can
    just do the test itself in a new macro:

    vm_match_cgroup(mm, cgroup)

    returns non-zero if the mm's mem_cgroup points to cgroup. Otherwise it
    returns zero.

    Signed-off-by: David Rientjes
    Cc: Balbir Singh
    Cc: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

08 Feb, 2008

9 commits

  • Based on the discussion at http://lkml.org/lkml/2007/12/20/383, it was felt
    that control_type might not be a good thing to implement right away. We
    can add this flexibility at a later point when required.

    Signed-off-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • … pages to be scanned per cgroup

    Define function for calculating the number of scan target on each Zone/LRU.

    Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Herbert Poetzl <herbert@13thfloor.at>
    Cc: Kirill Korotaev <dev@sw.ru>
    Cc: Nick Piggin <nickpiggin@yahoo.com.au>
    Cc: Paul Menage <menage@google.com>
    Cc: Pavel Emelianov <xemul@openvz.org>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    KAMEZAWA Hiroyuki
     
  • Functions to remember reclaim priority per cgroup (as zone->prev_priority)

    [akpm@linux-foundation.org: build fixes]
    [akpm@linux-foundation.org: more build fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Eric W. Biederman"
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Pavel Emelianov
    Cc: Peter Zijlstra
    Cc: Vaidyanathan Srinivasan
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • …ve imbalance per cgroup

    Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Herbert Poetzl <herbert@13thfloor.at>
    Cc: Kirill Korotaev <dev@sw.ru>
    Cc: Nick Piggin <nickpiggin@yahoo.com.au>
    Cc: Paul Menage <menage@google.com>
    Cc: Pavel Emelianov <xemul@openvz.org>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    KAMEZAWA Hiroyuki
     
  • Define function for calculating mapped_ratio in memory cgroup.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Eric W. Biederman"
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Pavel Emelianov
    Cc: Peter Zijlstra
    Cc: Vaidyanathan Srinivasan
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • While using memory control cgroup, page-migration under it works as following.
    ==
    1. uncharge all refs at try to unmap.
    2. charge regs again remove_migration_ptes()
    ==
    This is simple but has following problems.
    ==
    The page is uncharged and charged back again if *mapped*.
    - This means that cgroup before migration can be different from one after
    migration
    - If page is not mapped but charged as page cache, charge is just ignored
    (because not mapped, it will not be uncharged before migration)
    This is memory leak.
    ==
    This patch tries to keep memory cgroup at page migration by increasing
    one refcnt during it. 3 functions are added.

    mem_cgroup_prepare_migration() --- increase refcnt of page->page_cgroup
    mem_cgroup_end_migration() --- decrease refcnt of page->page_cgroup
    mem_cgroup_page_migration() --- copy page->page_cgroup from old page to
    new page.

    During migration
    - old page is under PG_locked.
    - new page is under PG_locked, too.
    - both old page and new page is not on LRU.

    These 3 facts guarantee that page_cgroup() migration has no race.

    Tested and worked well in x86_64/fake-NUMA box.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Creates a helper function to return non-zero if a task is a member of a
    memory controller:

    int task_in_mem_cgroup(const struct task_struct *task,
    const struct mem_cgroup *mem);

    When the OOM killer is constrained by the memory controller, the exclusion
    of tasks that are not a member of that controller was previously misplaced
    and appeared in the badness scoring function. It should be excluded
    during the tasklist scan in select_bad_process() instead.

    [akpm@linux-foundation.org: build fix]
    Cc: Christoph Lameter
    Cc: Balbir Singh
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Inline functions must preceed their use, so mm_cgroup() should be defined
    in linux/memcontrol.h.

    include/linux/memcontrol.h:48: warning: 'mm_cgroup' declared inline after
    being called
    include/linux/memcontrol.h:48: warning: previous declaration of
    'mm_cgroup' was here

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: nuther build fix]
    Cc: Balbir Singh
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Nick Piggin pointed out that swap cache and page cache addition routines
    could be called from non GFP_KERNEL contexts. This patch makes the
    charging routine aware of the gfp context. Charging might fail if the
    cgroup is over it's limit, in which case a suitable error is returned.

    This patch was tested on a Powerpc box. I am still looking at being able
    to test the path, through which allocations happen in non GFP_KERNEL
    contexts.

    [kamezawa.hiroyu@jp.fujitsu.com: problem with ZONE_MOVABLE]
    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh