29 May, 2009

2 commits

  • Fix build warning, "mem_cgroup_is_obsolete defined but not used" when
    CONFIG_DEBUG_VM is not set. Also avoid checking for !mem again and again.

    Signed-off-by: Nikanth Karthikesan
    Acked-by: Pekka Enberg
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikanth Karthikesan
     
  • mapping->tree_lock can be acquired from interrupt context. Then,
    following dead lock can occur.

    Assume "A" as a page.

    CPU0:
    lock_page_cgroup(A)
    interrupted
    -> take mapping->tree_lock.
    CPU1:
    take mapping->tree_lock
    -> lock_page_cgroup(A)

    This patch tries to fix above deadlock by moving memcg's hook to out of
    mapping->tree_lock. charge/uncharge of pagecache/swapcache is protected
    by page lock, not tree_lock.

    After this patch, lock_page_cgroup() is not called under mapping->tree_lock.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

03 May, 2009

2 commits

  • Current mem_cgroup_shrink_usage() has two problems.

    1. It doesn't call mem_cgroup_out_of_memory and doesn't update
    last_oom_jiffies, so pagefault_out_of_memory invokes global OOM.

    2. Considering hierarchy, shrinking has to be done from the
    mem_over_limit, not from the memcg which the page would be charged to.

    mem_cgroup_try_charge_swapin() does all of these things properly, so we
    use it and call cancel_charge_swapin when it succeeded.

    The name of "shrink_usage" is not appropriate for this behavior, so we
    change it too.

    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: Dhaval Giani
    Cc: Daisuke Nishimura
    Cc: YAMAMOTO Takashi
    Cc: KOSAKI Motohiro
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • This is a bugfix for commit 3c776e64660028236313f0e54f3a9945764422df
    ("memcg: charge swapcache to proper memcg").

    Used bit of swapcache is solid under page lock, but considering
    move_account, pc->mem_cgroup is not.

    We need lock_page_cgroup() anyway.

    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

14 Apr, 2009

1 commit


03 Apr, 2009

10 commits

  • Current mem_cgroup_cache_charge is a bit complicated especially
    in the case of shmem's swap-in.

    This patch cleans it up by using try_charge_swapin and commit_charge_swapin.

    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Try to use CSS ID for records in swap_cgroup. By this, on 64bit machine,
    size of swap_cgroup goes down to 2 bytes from 8bytes.

    This means, when 2GB of swap is equipped, (assume the page size is 4096bytes)

    From size of swap_cgroup = 2G/4k * 8 = 4Mbytes.
    To size of swap_cgroup = 2G/4k * 2 = 1Mbytes.

    Reduction is large. Of course, there are trade-offs. This CSS ID will
    add overhead to swap-in/swap-out/swap-free.

    But in general,
    - swap is a resource which the user tend to avoid use.
    - If swap is never used, swap_cgroup area is not used.
    - Reading traditional manuals, size of swap should be proportional to
    size of memory. Memory size of machine is increasing now.

    I think reducing size of swap_cgroup makes sense.

    Note:
    - ID->CSS lookup routine has no locks, it's under RCU-Read-Side.
    - memcg can be obsolete at rmdir() but not freed while refcnt from
    swap_cgroup is available.

    Changelog v4->v5:
    - reworked on to memcg-charge-swapcache-to-proper-memcg.patch
    Changlog ->v4:
    - fixed not configured case.
    - deleted unnecessary comments.
    - fixed NULL pointer bug.
    - fixed message in dmesg.

    [nishimura@mxp.nes.nec.co.jp: css_tryget can be called twice in !PageCgroupUsed case]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Paul Menage
    Cc: Hugh Dickins
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • memcg_test.txt says at 4.1:

    This swap-in is one of the most complicated work. In do_swap_page(),
    following events occur when pte is unchanged.

    (1) the page (SwapCache) is looked up.
    (2) lock_page()
    (3) try_charge_swapin()
    (4) reuse_swap_page() (may call delete_swap_cache())
    (5) commit_charge_swapin()
    (6) swap_free().

    Considering following situation for example.

    (A) The page has not been charged before (2) and reuse_swap_page()
    doesn't call delete_from_swap_cache().
    (B) The page has not been charged before (2) and reuse_swap_page()
    calls delete_from_swap_cache().
    (C) The page has been charged before (2) and reuse_swap_page() doesn't
    call delete_from_swap_cache().
    (D) The page has been charged before (2) and reuse_swap_page() calls
    delete_from_swap_cache().

    memory.usage/memsw.usage changes to this page/swp_entry will be
    Case (A) (B) (C) (D)
    Event
    Before (2) 0/ 1 0/ 1 1/ 1 1/ 1
    ===========================================
    (3) +1/+1 +1/+1 +1/+1 +1/+1
    (4) - 0/ 0 - -1/ 0
    (5) 0/-1 0/ 0 -1/-1 0/ 0
    (6) - 0/-1 - 0/-1
    ===========================================
    Result 1/ 1 1/ 1 1/ 1 1/ 1

    In any cases, charges to this page should be 1/ 1.

    In case of (D), mem_cgroup_try_get_from_swapcache() returns NULL
    (because lookup_swap_cgroup() returns NULL), so "+1/+1" at (3) means
    charges to the memcg("foo") to which the "current" belongs.
    OTOH, "-1/0" at (4) and "0/-1" at (6) means uncharges from the memcg("baa")
    to which the page has been charged.

    So, if the "foo" and "baa" is different(for example because of task move),
    this charge will be moved from "baa" to "foo".

    I think this is an unexpected behavior.

    This patch fixes this by modifying mem_cgroup_try_get_from_swapcache()
    to return the memcg to which the swapcache has been charged if PCG_USED bit
    is set.
    IIUC, checking PCG_USED bit of swapcache is safe under page lock.

    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Currently, mem_cgroup_calc_mapped_ratio() is unused at all. it can be
    removed and KAMEZAWA-san suggested it.

    Signed-off-by: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Add RSS and swap to OOM output from memcg

    Display memcg values like failcnt, usage and limit when an OOM occurs due
    to memcg.

    Thanks to Johannes Weiner, Li Zefan, David Rientjes, Kamezawa Hiroyuki,
    Daisuke Nishimura and KOSAKI Motohiro for review.

    Sample output
    -------------

    Task in /a/x killed as a result of limit of /a
    memory: usage 1048576kB, limit 1048576kB, failcnt 4183
    memory+swap: usage 1400964kB, limit 9007199254740991kB, failcnt 0

    [akpm@linux-foundation.org: compilation fix]
    [akpm@linux-foundation.org: fix kerneldoc and whitespace]
    [akpm@linux-foundation.org: add printk facility level]
    Signed-off-by: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • This patch tries to fix OOM Killer problems caused by hierarchy.
    Now, memcg itself has OOM KILL function (in oom_kill.c) and tries to
    kill a task in memcg.

    But, when hierarchy is used, it's broken and correct task cannot
    be killed. For example, in following cgroup

    /groupA/ hierarchy=1, limit=1G,
    01 nolimit
    02 nolimit
    All tasks' memory usage under /groupA, /groupA/01, groupA/02 is limited to
    groupA's 1Gbytes but OOM Killer just kills tasks in groupA.

    This patch provides makes the bad process be selected from all tasks
    under hierarchy. BTW, currently, oom_jiffies is updated against groupA
    in above case. oom_jiffies of tree should be updated.

    To see how oom_jiffies is used, please check mem_cgroup_oom_called()
    callers.

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: const fix]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • As pointed out, shrinking memcg's limit should return -EBUSY after
    reasonable retries. This patch tries to fix the current behavior of
    shrink_usage.

    Before looking into "shrink should return -EBUSY" problem, we should fix
    hierarchical reclaim code. It compares current usage and current limit,
    but it only makes sense when the kernel reclaims memory because hit
    limits. This is also a problem.

    What this patch does are.

    1. add new argument "shrink" to hierarchical reclaim. If "shrink==true",
    hierarchical reclaim returns immediately and the caller checks the kernel
    should shrink more or not.
    (At shrinking memory, usage is always smaller than limit. So check for
    usage < limit is useless.)

    2. For adjusting to above change, 2 changes in "shrink"'s retry path.
    2-a. retry_count depends on # of children because the kernel visits
    the children under hierarchy one by one.
    2-b. rather than checking return value of hierarchical_reclaim's progress,
    compares usage-before-shrink and usage-after-shrink.
    If usage-before-shrink
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Clean up memory.stat file routine and show "total" hierarchical stat.

    This patch does
    - renamed get_all_zonestat to be get_local_zonestat.
    - remove old mem_cgroup_stat_desc, which is only for per-cpu stat.
    - add mcs_stat to cover both of per-cpu/per-lru stat.
    - add "total" stat of hierarchy (*)
    - add a callback system to scan all memcg under a root.
    == "total" is added.
    [kamezawa@localhost ~]$ cat /opt/cgroup/xxx/memory.stat
    cache 0
    rss 0
    pgpgin 0
    pgpgout 0
    inactive_anon 0
    active_anon 0
    inactive_file 0
    active_file 0
    unevictable 0
    hierarchical_memory_limit 50331648
    hierarchical_memsw_limit 9223372036854775807
    total_cache 65536
    total_rss 192512
    total_pgpgin 218
    total_pgpgout 155
    total_inactive_anon 0
    total_active_anon 135168
    total_inactive_file 61440
    total_active_file 4096
    total_unevictable 0
    ==
    (*) maybe the user can do calc hierarchical stat by his own program
    in userland but if it can be written in clean way, it's worth to be
    shown, I think.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Assigning CSS ID for each memcg and use css_get_next() for scanning hierarchy.

    Assume folloing tree.

    group_A (ID=3)
    /01 (ID=4)
    /0A (ID=7)
    /02 (ID=10)
    group_B (ID=5)
    and task in group_A/01/0A hits limit at group_A.

    reclaim will be done in following order (round-robin).
    group_A(3) -> group_A/01 (4) -> group_A/01/0A (7) -> group_A/02(10)
    -> group_A -> .....

    Round robin by ID. The last visited cgroup is recorded and restart
    from it when it start reclaim again.
    (More smart algorithm can be implemented..)

    No cgroup_mutex or hierarchy_mutex is required.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • In following situation, with memory subsystem,

    /groupA use_hierarchy==1
    /01 some tasks
    /02 some tasks
    /03 some tasks
    /04 empty

    When tasks under 01/02/03 hit limit on /groupA, hierarchical reclaim
    is triggered and the kernel walks tree under groupA. In this case,
    rmdir /groupA/04 fails with -EBUSY frequently because of temporal
    refcnt from the kernel.

    In general. cgroup can be rmdir'd if there are no children groups and
    no tasks. Frequent fails of rmdir() is not useful to users.
    (And the reason for -EBUSY is unknown to users.....in most cases)

    This patch tries to modify above behavior, by
    - retries if css_refcnt is got by someone.
    - add "return value" to pre_destroy() and allows subsystem to
    say "we're really busy!"

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

30 Jan, 2009

2 commits

  • N_POSSIBLE doesn't means there is memory...and force_empty can
    visit invalid node which have no pgdat.

    To visit all valid nodes, N_HIGH_MEMORY should be used.

    Reported-by: Li Zefan
    Signed-off-by: KAMEZAWA Hiroyuki
    Tested-by: Li Zefan
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • The lifetime of struct cgroup and struct mem_cgroup is different and
    mem_cgroup has its own reference count for handling references from
    swap_cgroup.

    This causes strange problem that the parent mem_cgroup dies while child
    mem_cgroup alive, and this problem causes a bug in case of
    use_hierarchy==1 because res_counter_uncharge climbs up the tree.

    This patch is for avoiding it by getting the parent at create, and putting
    it at freeing.

    Signed-off-by: Daisuke Nishimura
    Reviewed-by; KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: Li Zefan
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

16 Jan, 2009

6 commits

  • (suppose: memcg->use_hierarchy == 0 and memcg->swappiness == 60)

    echo 10 > /memcg/0/swappiness |
    mem_cgroup_swappiness_write() |
    ... | echo 1 > /memcg/0/use_hierarchy
    | mkdir /mnt/0/1
    | sub_memcg->swappiness = 60;
    memcg->swappiness = 10; |

    In the above scenario, we end up having 2 different swappiness
    values in a single hierarchy.

    We should hold cgroup_lock() when cheking cgrp->children list.

    Signed-off-by: Li Zefan
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Paul Menage
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • At system boot when creating the top cgroup, mem_cgroup_create() calls
    enable_swap_cgroup() which is marked as __init, so mark
    mem_cgroup_create() as __ref to avoid false section mismatch warning.

    Reported-by: Rakib Mullick
    Signed-off-by: Li Zefan
    Acked-by; KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • In previous implementation, mem_cgroup_try_charge checked the return
    value of mem_cgroup_try_to_free_pages, and just retried if some pages
    had been reclaimed.
    But now, try_charge(and mem_cgroup_hierarchical_reclaim called from it)
    only checks whether the usage is less than the limit.

    This patch tries to change the behavior as before to cause oom less
    frequently.

    Signed-off-by: Daisuke Nishimura
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Pavel Emelyanov
    Cc: Li Zefan
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • If root_mem has no children, last_scaned_child is set to root_mem itself.
    But after some children added to root_mem, mem_cgroup_get_next_node can
    mem_cgroup_put the root_mem although root_mem has not been mem_cgroup_get.

    This patch fixes this behavior by:

    - Set last_scanned_child to NULL if root_mem has no children or DFS
    search has returned to root_mem itself(root_mem is not a "child" of
    root_mem). Make mem_cgroup_get_first_node return root_mem in this case.
    There are no mem_cgroup_get/put for root_mem.

    - Rename mem_cgroup_get_next_node to __mem_cgroup_get_next_node, and
    mem_cgroup_get_first_node to mem_cgroup_get_next_node. Make
    mem_cgroup_hierarchical_reclaim call only new mem_cgroup_get_next_node.

    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Pavel Emelyanov
    Cc: Li Zefan
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • There is a bug in error path of mem_cgroup_move_parent.

    Extra refcnt got from try_charge should be dropped, and usages incremented
    by try_charge should be decremented in both error paths:

    A: failure at get_page_unless_zero
    B: failure at isolate_lru_page

    This bug makes this parent directory unremovable.

    In case of A, rmdir doesn't return, because res.usage doesn't go down to 0
    at mem_cgroup_force_empty even after all the pc in lru are removed.

    In case of B, rmdir fails and returns -EBUSY, because it has extra ref
    counts even after res.usage goes down to 0.

    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: Li Zefan
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • In case of swapin, a new page is added to lru before it is charged,
    so page->pc->mem_cgroup points to NULL or last mem_cgroup the page
    was charged before.

    In the latter case, if the mem_cgroup has already freed by rmdir,
    the area pointed to by page->pc->mem_cgroup may have invalid data.

    Actually, I saw general protection fault.

    general protection fault: 0000 [#1] SMP
    last sysfs file: /sys/devices/system/cpu/cpu15/cache/index1/shared_cpu_map
    CPU 4
    Modules linked in: ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp ipv6 autofs4 hidp rfcomm l2cap bluetooth sunrpc dm_mirror dm_region_hash dm_log dm_multipath dm_mod rfkill input_polldev sbs sbshc battery ac lp sg ide_cd_mod cdrom button serio_raw acpi_memhotplug parport_pc e1000 rtc_cmos parport rtc_core rtc_lib i2c_i801 i2c_core shpchp pcspkr ata_piix libata megaraid_mbox megaraid_mm sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd [last unloaded: microcode]
    Pid: 26038, comm: page01 Tainted: G W 2.6.28-rc9-mm1-mmotm-2008-12-22-16-14-f2ab3dea #1
    RIP: 0010:[] [] update_page_reclaim_stat+0x2f/0x42
    RSP: 0000:ffff8801ee457da8 EFLAGS: 00010002
    RAX: 32353438312021c8 RBX: 0000000000000000 RCX: 32353438312021c8
    RDX: 0000000000000000 RSI: ffff8800cb0b1000 RDI: ffff8801164d1d28
    RBP: ffff880110002cb8 R08: ffff88010f2eae23 R09: 0000000000000001
    R10: ffff8800bc514b00 R11: ffff880110002c00 R12: 0000000000000000
    R13: ffff88000f484100 R14: 0000000000000003 R15: 00000000001200d2
    FS: 00007f8a261726f0(0000) GS:ffff88010f2eaa80(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007f8a25d22000 CR3: 00000001ef18c000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process page01 (pid: 26038, threadinfo ffff8801ee456000, task ffff8800b585b960)
    Stack:
    ffffe200071ee568 ffff880110001f00 0000000000000000 ffffffff8028ea17
    ffff88000f484100 0000000000000000 0000000000000020 00007f8a25d22000
    ffff8800bc514b00 ffffffff8028ec34 0000000000000000 0000000000016fd8
    Call Trace:
    [] ? ____pagevec_lru_add+0xc1/0x13c
    [] ? drain_cpu_pagevecs+0x36/0x89
    [] ? swapin_readahead+0x78/0x98
    [] ? handle_mm_fault+0x3d9/0x741
    [] ? do_page_fault+0x3ce/0x78c
    [] ? trace_hardirqs_off_thunk+0x3a/0x3c
    [] ? page_fault+0x1f/0x30
    Code: cc 55 48 8d af b8 0d 00 00 48 89 f7 53 89 d3 e8 39 85 02 00 48 63 d3 48 ff 44 d5 10 45 85 e4 74 05 48 ff 44 d5 00 48 85 c0 74 0e ff 44 d0 10 45 85 e4 74 04 48 ff 04 d0 5b 5d 41 5c c3 41 54
    RIP [] update_page_reclaim_stat+0x2f/0x42
    RSP

    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: Li Zefan
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

09 Jan, 2009

17 commits

  • Update the memory controller to use its hierarchy_mutex rather than
    calling cgroup_lock() to protected against cgroup_mkdir()/cgroup_rmdir()
    from occurring in its hierarchy.

    Signed-off-by: Paul Menage
    Tested-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Now, you can see following even when swap accounting is enabled.

    1. Create Group 01, and 02.
    2. allocate a "file" on tmpfs by a task under 01.
    3. swap out the "file" (by memory pressure)
    4. Read "file" from a task in group 02.
    5. the charge of "file" is moved to group 02.

    This is not ideal behavior. This is because SwapCache which was loaded
    by read-ahead is not taken into account..

    This is a patch to fix shmem's swapcache behavior.
    - remove mem_cgroup_cache_charge_swapin().
    - Add SwapCache handler routine to mem_cgroup_cache_charge().
    By this, shmem's file cache is charged at add_to_page_cache()
    with GFP_NOWAIT.
    - pass the page of swapcache to shrink_mem_cgroup.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Now, a page can be deleted from SwapCache while do_swap_page().
    memcg-fix-swap-accounting-leak-v3.patch handles that, but, LRU handling is
    still broken. (above behavior broke assumption of memcg-synchronized-lru
    patch.)

    This patch is a fix for LRU handling (especially for per-zone counters).
    At charging SwapCache,
    - Remove page_cgroup from LRU if it's not used.
    - Add page cgroup to LRU if it's not linked to.

    Reported-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Paul Menage
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • From:KAMEZAWA Hiroyuki

    css_tryget() newly is added and we can know css is alive or not and get
    refcnt of css in very safe way. ("alive" here means "rmdir/destroy" is
    not called.)

    This patch replaces css_get() to css_tryget(), where I cannot explain
    why css_get() is safe. And removes memcg->obsolete flag.

    Reviewed-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Paul Menage
    Cc: Daisuke Nishimura
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • 1. Fix double-free BUG in error route of mem_cgroup_create().
    mem_cgroup_free() itself frees per-zone-info.
    2. Making refcnt of memcg simple.
    Add 1 refcnt at creation and call free when refcnt goes down to 0.

    Reviewed-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Paul Menage
    Cc: Daisuke Nishimura
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Fix swapin charge operation of memcg.

    Now, memcg has hooks to swap-out operation and checks SwapCache is really
    unused or not. That check depends on contents of struct page. I.e. If
    PageAnon(page) && page_mapped(page), the page is recoginized as
    still-in-use.

    Now, reuse_swap_page() calles delete_from_swap_cache() before establishment
    of any rmap. Then, in followinig sequence

    (Page fault with WRITE)
    try_charge() (charge += PAGESIZE)
    commit_charge() (Check page_cgroup is used or not..)
    reuse_swap_page()
    -> delete_from_swapcache()
    -> mem_cgroup_uncharge_swapcache() (charge -= PAGESIZE)
    ......
    New charge is uncharged soon....
    To avoid this, move commit_charge() after page_mapcount() goes up to 1.
    By this,

    try_charge() (usage += PAGESIZE)
    reuse_swap_page() (may usage -= PAGESIZE if PCG_USED is set)
    commit_charge() (If page_cgroup is not marked as PCG_USED,
    add new charge.)
    Accounting will be correct.

    Changelog (v2) -> (v3)
    - fixed invalid charge to swp_entry==0.
    - updated documentation.
    Changelog (v1) -> (v2)
    - fixed comment.

    [nishimura@mxp.nes.nec.co.jp: swap accounting leak doc fix]
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Tested-by: Balbir Singh
    Cc: Hugh Dickins
    Cc: Daisuke Nishimura
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • mem_cgroup_hierarchicl_reclaim() works properly even when !use_hierarchy
    now (by memcg-hierarchy-avoid-unnecessary-reclaim.patch), so, instead of
    try_to_free_mem_cgroup_pages(), it should be used in many cases.

    The only exception is force_empty. The group has no children in this
    case.

    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • mpol_rebind_mm(), which can be called from cpuset_attach(), does
    down_write(mm->mmap_sem). This means down_write(mm->mmap_sem) can be
    called under cgroup_mutex.

    OTOH, page fault path does down_read(mm->mmap_sem) and calls
    mem_cgroup_try_charge_xxx(), which may eventually calls
    mem_cgroup_out_of_memory(). And mem_cgroup_out_of_memory() calls
    cgroup_lock(). This means cgroup_lock() can be called under
    down_read(mm->mmap_sem).

    If those two paths race, deadlock can happen.

    This patch avoid this deadlock by:
    - remove cgroup_lock() from mem_cgroup_out_of_memory().
    - define new mutex (memcg_tasklist) and serialize mem_cgroup_move_task()
    (->attach handler of memory cgroup) and mem_cgroup_out_of_memory.

    Signed-off-by: Daisuke Nishimura
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • After previous patch, mem_cgroup_try_charge is not used by anyone, so we
    can remove it.

    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • I think triggering OOM at mem_cgroup_prepare_migration would be just a bit
    overkill. Returning -ENOMEM would be enough for
    mem_cgroup_prepare_migration. The caller would handle the case anyway.

    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Show "real" limit of memcg. This helps my debugging and maybe useful for
    users.

    While testing hierarchy like this

    mount -t cgroup none /cgroup -t memory
    mkdir /cgroup/A
    set use_hierarchy==1 to "A"
    mkdir /cgroup/A/01
    mkdir /cgroup/A/01/02
    mkdir /cgroup/A/01/03
    mkdir /cgroup/A/01/03/04
    mkdir /cgroup/A/08
    mkdir /cgroup/A/08/01
    ....
    and set each own limit to them, "real" limit of each memcg is unclear.
    This patch shows real limit by checking all ancestors.

    Changelog: (v1) -> (v2)
    - remove "if" and use "min(a,b)"

    Acked-by: Balbir Singh
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Currently, inactive_ratio of memcg is calculated at setting limit.
    because page_alloc.c does so and current implementation is straightforward
    porting.

    However, memcg introduced hierarchy feature recently. In hierarchy
    restriction, memory limit is not only decided memory.limit_in_bytes of
    current cgroup, but also parent limit and sibling memory usage.

    Then, The optimal inactive_ratio is changed frequently. So, everytime
    calculation is better.

    Tested-by: KAMEZAWA Hiroyuki
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: KOSAKI Motohiro
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Currently, /proc/sys/vm/swappiness can change swappiness ratio for global
    reclaim. However, memcg reclaim doesn't have tuning parameter for itself.

    In general, the optimal swappiness depend on workload. (e.g. hpc
    workload need to low swappiness than the others.)

    Then, per cgroup swappiness improve administrator tunability.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: KOSAKI Motohiro
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Currently, mem_cgroup doesn't have own lock and almost its member doesn't
    need. (e.g. mem_cgroup->info is protected by zone lock, mem_cgroup->stat
    is per cpu variable)

    However, there is one explict exception. mem_cgroup->prev_priorit need
    lock, but doesn't protect. Luckly, this is NOT bug because prev_priority
    isn't used for current reclaim code.

    However, we plan to use prev_priority future again. Therefore, fixing is
    better.

    In addition, we plan to reuse this lock for another member. Then
    "reclaim_param_lock" name is better than "prev_priority_lock".

    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: KOSAKI Motohiro
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Add the following four fields to memory.stat file:

    - inactive_ratio
    - recent_rotated_anon
    - recent_rotated_file
    - recent_scanned_anon
    - recent_scanned_file

    Acked-by: Rik van Riel
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: KOSAKI Motohiro
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Now, get_scan_ratio() return correct value although memcg reclaim. Then,
    mem_cgroup_calc_reclaim() can be removed.

    So, memcg reclaim get the same capability of anon/file reclaim balancing
    as global reclaim now.

    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Introduce mem_cgroup_per_zone::reclaim_stat member and its statics
    collecting function.

    Now, get_scan_ratio() can calculate correct value on memcg reclaim.

    [hugh@veritas.com: avoid reclaim_stat oops when disabled]
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro