12 Feb, 2009

1 commit

  • page_cgroup's page allocation at init/memory hotplug uses kmalloc() and
    vmalloc(). If kmalloc() failes, vmalloc() is used.

    This is because vmalloc() is very limited resource on 32bit systems.
    We want to use kmalloc() first.

    But in this kind of call, __GFP_NOWARN should be specified.

    Reported-by: Heiko Carstens
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Acked-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

09 Jan, 2009

4 commits

  • We check mem_cgroup is disabled or not by checking
    mem_cgroup_subsys.disabled. I think it has more references than expected,
    now.

    replacing
    if (mem_cgroup_subsys.disabled)
    with
    if (mem_cgroup_disabled())

    give us good look, I think.

    [kamezawa.hiroyu@jp.fujitsu.com: fix typo]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hirokazu Takahashi
     
  • A big patch for changing memcg's LRU semantics.

    Now,
    - page_cgroup is linked to mem_cgroup's its own LRU (per zone).

    - LRU of page_cgroup is not synchronous with global LRU.

    - page and page_cgroup is one-to-one and statically allocated.

    - To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as
    - lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc);

    - SwapCache is handled.

    And, when we handle LRU list of page_cgroup, we do following.

    pc = lookup_page_cgroup(page);
    lock_page_cgroup(pc); .....................(1)
    mz = page_cgroup_zoneinfo(pc);
    spin_lock(&mz->lru_lock);
    .....add to LRU
    spin_unlock(&mz->lru_lock);
    unlock_page_cgroup(pc);

    But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock.
    So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct.

    This is a trial to remove this dirty nesting of locks.
    This patch changes mz->lru_lock to be zone->lru_lock.
    Then, above sequence will be written as

    spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
    mem_cgroup_add/remove/etc_lru() {
    pc = lookup_page_cgroup(page);
    mz = page_cgroup_zoneinfo(pc);
    if (PageCgroupUsed(pc)) {
    ....add to LRU
    }
    spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU

    This is much simpler.
    (*) We're safe even if we don't take lock_page_cgroup(pc). Because..
    1. When pc->mem_cgroup can be modified.
    - at charge.
    - at account_move().
    2. at charge
    the PCG_USED bit is not set before pc->mem_cgroup is fixed.
    3. at account_move()
    the page is isolated and not on LRU.

    Pros.
    - easy for maintenance.
    - memcg can make use of laziness of pagevec.
    - we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup.
    - LRU status of memcg will be synchronized with global LRU's one.
    - # of locks are reduced.
    - account_move() is simplified very much.
    Cons.
    - may increase cost of LRU rotation.
    (no impact if memcg is not configured.)

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • For accounting swap, we need a record per swap entry, at least.

    This patch adds following function.
    - swap_cgroup_swapon() .... called from swapon
    - swap_cgroup_swapoff() ... called at the end of swapoff.

    - swap_cgroup_record() .... record information of swap entry.
    - swap_cgroup_lookup() .... lookup information of swap entry.

    This patch just implements "how to record information". No actual method
    for limit the usage of swap. These routine uses flat table to record and
    lookup. "wise" lookup system like radix-tree requires requires memory
    allocation at new records but swap-out is usually called under memory
    shortage (or memcg hits limit.) So, I used static allocation. (maybe
    dynamic allocation is not very hard but it adds additional memory
    allocation in memory shortage path.)

    Note1: In this, we use pointer to record information and this means
    8bytes per swap entry. I think we can reduce this when we
    create "id of cgroup" in the range of 0-65535 or 0-255.

    Reported-by: Daisuke Nishimura
    Reviewed-by: Daisuke Nishimura
    Tested-by: Daisuke Nishimura
    Reported-by: Hugh Dickins
    Reported-by: Balbir Singh
    Reported-by: Andrew Morton
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Pavel Emelianov
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • In init_section_page_cgroup() the section a given pfn belongs to is
    calculated at the top of the function and, despite the fact that the
    pfn/section correspondence does not change, it is recalculated further
    down the same function. By computing this just once and reusing that
    value we save some bytes in the object file and do not waste CPU cycles.

    Signed-off-by: Fernando Luis Vazquez Cao
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fernando Luis Vazquez Cao
     

07 Jan, 2009

1 commit


11 Dec, 2008

1 commit


02 Dec, 2008

1 commit

  • Fixes for memcg/memory hotplug.

    While memory hotplug allocate/free memmap, page_cgroup doesn't free
    page_cgroup at OFFLINE when page_cgroup is allocated via bootomem.
    (Because freeing bootmem requires special care.)

    Then, if page_cgroup is allocated by bootmem and memmap is freed/allocated
    by memory hotplug, page_cgroup->page == page is no longer true.

    But current MEM_ONLINE handler doesn't check it and update
    page_cgroup->page if it's not necessary to allocate page_cgroup. (This
    was not found because memmap is not freed if SPARSEMEM_VMEMMAP is y.)

    And I noticed that MEM_ONLINE can be called against "part of section".
    So, freeing page_cgroup at CANCEL_ONLINE will cause trouble. (freeing
    used page_cgroup) Don't rollback at CANCEL.

    One more, current memory hotplug notifier is stopped by slub because it
    sets NOTIFY_STOP_MASK to return vaule. So, page_cgroup's callback never
    be called. (low priority than slub now.)

    I think this slub's behavior is not intentional(BUG). and fixes it.

    Another way to be considered about page_cgroup allocation:
    - free page_cgroup at OFFLINE even if it's from bootmem
    and remove specieal handler. But it requires more changes.

    Addresses http://bugzilla.kernel.org/show_bug.cgi?id=12041

    Signed-off-by: KAMEZAWA Hiruyoki
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Tested-by: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

01 Dec, 2008

1 commit


13 Nov, 2008

1 commit


23 Oct, 2008

2 commits

  • page_cgroup_init() is called from mem_cgroup_init(). But at this
    point, we cannot call alloc_bootmem().
    (and this caused panic at boot.)

    This patch moves page_cgroup_init() to init/main.c.

    Time table is following:
    ==
    parse_args(). # we can trust mem_cgroup_subsys.disabled bit after this.
    ....
    cgroup_init_early() # "early" init of cgroup.
    ....
    setup_arch() # memmap is allocated.
    ...
    page_cgroup_init();
    mem_init(); # we cannot call alloc_bootmem after this.
    ....
    cgroup_init() # mem_cgroup is initialized.
    ==

    Before page_cgroup_init(), mem_map must be initialized. So,
    I added page_cgroup_init() to init/main.c directly.

    (*) maybe this is not very clean but
    - cgroup_init_early() is too early
    - in cgroup_init(), we have to use vmalloc instead of alloc_bootmem().
    use of vmalloc area in x86-32 is important and we should avoid very large
    vmalloc() in x86-32. So, we want to use alloc_bootmem() and added page_cgroup_init()
    directly to init/main.c

    [akpm@linux-foundation.org: remove unneeded/bad mem_cgroup_subsys declaration]
    [akpm@linux-foundation.org: fix build]
    Acked-by: Balbir Singh
    Tested-by: Balbir Singh
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • mm/page_cgroup.c: In function 'init_section_page_cgroup':
    mm/page_cgroup.c:111: error: implicit declaration of function 'vmalloc_node'
    mm/page_cgroup.c:111: warning: assignment makes pointer from integer without a cast
    mm/page_cgroup.c: In function '__free_page_cgroup':
    mm/page_cgroup.c:140: error: implicit declaration of function 'vfree'

    Signed-off-by: Paul Mundt
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Mundt
     

20 Oct, 2008

1 commit

  • Allocate all page_cgroup at boot and remove page_cgroup poitner from
    struct page. This patch adds an interface as

    struct page_cgroup *lookup_page_cgroup(struct page*)

    All FLATMEM/DISCONTIGMEM/SPARSEMEM and MEMORY_HOTPLUG is supported.

    Remove page_cgroup pointer reduces the amount of memory by
    - 4 bytes per PAGE_SIZE.
    - 8 bytes per PAGE_SIZE
    if memory controller is disabled. (even if configured.)

    On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory.
    On my x86-64 server with 48GB of memory, this saves 96MB of memory.
    I think this reduction makes sense.

    By pre-allocation, kmalloc/kfree in charge/uncharge are removed.
    This means
    - we're not necessary to be afraid of kmalloc faiulre.
    (this can happen because of gfp_mask type.)
    - we can avoid calling kmalloc/kfree.
    - we can avoid allocating tons of small objects which can be fragmented.
    - we can know what amount of memory will be used for this extra-lru handling.

    I added printk message as

    "allocated %ld bytes of page_cgroup"
    "please try cgroup_disable=memory option if you don't want"

    maybe enough informative for users.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki