31 Mar, 2011

1 commit


24 Mar, 2011

3 commits

  • In struct page_cgroup, we have a full word for flags but only a few are
    reserved. Use the remaining upper bits to encode, depending on
    configuration, the node or the section, to enable page_cgroup-to-page
    lookups without a direct pointer.

    This saves a full word for every page in a system with memory cgroups
    enabled.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Minchan Kim
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • It is one logical function, no need to have it split up.

    Also, get rid of some checks from the inner function that ensured the
    sanity of the outer function.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Instead of passing a whole struct page_cgroup to this function, let it
    take only what it really needs from it: the struct mem_cgroup and the
    page.

    This has the advantage that reading pc->mem_cgroup is now done at the same
    place where the ordering rules for this pointer are enforced and
    explained.

    It is also in preparation for removing the pc->page backpointer.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

14 Jan, 2011

3 commits

  • The flags added by commit db16d5ec1f87f17511599bc77857dd1662b5a22f
    has no user now. We believe we'll use it soon but considering
    patch reviewing, the change itself should be folded into incoming
    set of "dirty ratio for memcg" patches.

    So, it's better to drop this change from current mainline tree.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Greg Thelen
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Introduce a new bit spin lock, PCG_MOVE_LOCK, to synchronize the page
    accounting and migration code. This reworks the locking scheme of
    _update_stat() and _move_account() by adding new lock bit PCG_MOVE_LOCK,
    which is always taken under IRQ disable.

    1. If pages are being migrated from a memcg, then updates to that
    memcg page statistics are protected by grabbing PCG_MOVE_LOCK using
    move_lock_page_cgroup(). In an upcoming commit, memcg dirty page
    accounting will be updating memcg page accounting (specifically: num
    writeback pages) from IRQ context (softirq). Avoid a deadlocking
    nested spin lock attempt by disabling irq on the local processor when
    grabbing the PCG_MOVE_LOCK.

    2. lock for update_page_stat is used only for avoiding race with
    move_account(). So, IRQ awareness of lock_page_cgroup() itself is not
    a problem. The problem is between mem_cgroup_update_page_stat() and
    mem_cgroup_move_account_page().

    Trade-off:
    * Changing lock_page_cgroup() to always disable IRQ (or
    local_bh) has some impacts on performance and I think
    it's bad to disable IRQ when it's not necessary.
    * adding a new lock makes move_account() slower. Score is
    here.

    Performance Impact: moving a 8G anon process.

    Before:
    real 0m0.792s
    user 0m0.000s
    sys 0m0.780s

    After:
    real 0m0.854s
    user 0m0.000s
    sys 0m0.842s

    This score is bad but planned patches for optimization can reduce
    this impact.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Greg Thelen
    Reviewed-by: Minchan Kim
    Acked-by: Daisuke Nishimura
    Cc: Andrea Righi
    Cc: Balbir Singh
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patchset provides the ability for each cgroup to have independent
    dirty page limits.

    Limiting dirty memory is like fixing the max amount of dirty (hard to
    reclaim) page cache used by a cgroup. So, in case of multiple cgroup
    writers, they will not be able to consume more than their designated share
    of dirty pages and will be forced to perform write-out if they cross that
    limit.

    The patches are based on a series proposed by Andrea Righi in Mar 2010.

    Overview:

    - Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
    unstable.

    - Extend mem_cgroup to record the total number of pages in each of the
    interesting dirty states (dirty, writeback, unstable_nfs).

    - Add dirty parameters similar to the system-wide /proc/sys/vm/dirty_*
    limits to mem_cgroup. The mem_cgroup dirty parameters are accessible
    via cgroupfs control files.

    - Consider both system and per-memcg dirty limits in page writeback when
    deciding to queue background writeback or block for foreground writeback.

    Known shortcomings:

    - When a cgroup dirty limit is exceeded, then bdi writeback is employed to
    writeback dirty inodes. Bdi writeback considers inodes from any cgroup, not
    just inodes contributing dirty pages to the cgroup exceeding its limit.

    - When memory.use_hierarchy is set, then dirty limits are disabled. This is a
    implementation detail. An enhanced implementation is needed to check the
    chain of parents to ensure that no dirty limit is exceeded.

    Performance data:
    - A page fault microbenchmark workload was used to measure performance, which
    can be called in read or write mode:
    f = open(foo. $cpu)
    truncate(f, 4096)
    alarm(60)
    while (1) {
    p = mmap(f, 4096)
    if (write)
    *p = 1
    else
    x = *p
    munmap(p)
    }

    - The workload was called for several points in the patch series in different
    modes:
    - s_read is a single threaded reader
    - s_write is a single threaded writer
    - p_read is a 16 thread reader, each operating on a different file
    - p_write is a 16 thread writer, each operating on a different file

    - Measurements were collected on a 16 core non-numa system using "perf stat
    --repeat 3". The -a option was used for parallel (p_*) runs.

    - All numbers are page fault rate (M/sec). Higher is better.

    - To compare the performance of a kernel without non-memcg compare the first and
    last rows, neither has memcg configured. The first row does not include any
    of these memcg patches.

    - To compare the performance of using memcg dirty limits, compare the baseline
    (2nd row titled "w/ memcg") with the the code and memcg enabled (2nd to last
    row titled "all patches").

    root_cgroup child_cgroup
    s_read s_write p_read p_write s_read s_write p_read p_write
    mmotm w/o memcg 0.428 0.390 0.429 0.388
    mmotm w/ memcg 0.411 0.378 0.391 0.362 0.412 0.377 0.385 0.363
    all patches 0.384 0.360 0.370 0.348 0.381 0.363 0.368 0.347
    all patches 0.431 0.402 0.427 0.395
    w/o memcg

    This patch:

    Add additional flags to page_cgroup to track dirty pages within a
    mem_cgroup.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrea Righi
    Signed-off-by: Greg Thelen
    Acked-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Minchan Kim
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     

25 Nov, 2010

1 commit

  • Fix this:

    kernel BUG at mm/memcontrol.c:2155!
    invalid opcode: 0000 [#1]
    last sysfs file:

    Pid: 18, comm: sh Not tainted 2.6.37-rc3 #3 /Bochs
    EIP: 0060:[] EFLAGS: 00000246 CPU: 0
    EIP is at mem_cgroup_move_account+0xe2/0xf0
    EAX: 00000004 EBX: c6f931d4 ECX: c681c300 EDX: c681c000
    ESI: c681c300 EDI: ffffffea EBP: c681c000 ESP: c46f3e30
    DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
    Process sh (pid: 18, ti=c46f2000 task=c6826e60 task.ti=c46f2000)
    Stack:
    00000155 c681c000 0805f000 c46ee180 c46f3e5c c7058820 c1074d37 00000000
    08060000 c46db9a0 c46ec080 c7058820 0805f000 08060000 c46f3e98 c1074c50
    c106c75e c46f3e98 c46ec080 08060000 0805ffff c46db9a0 c46f3e98 c46e0340
    Call Trace:
    [] ? mem_cgroup_move_charge_pte_range+0xe7/0x130
    [] ? mem_cgroup_move_charge_pte_range+0x0/0x130
    [] ? walk_page_range+0xee/0x1d0
    [] ? mem_cgroup_move_task+0x66/0x90
    [] ? mem_cgroup_move_charge_pte_range+0x0/0x130
    [] ? mem_cgroup_move_task+0x0/0x90
    [] ? cgroup_attach_task+0x136/0x200
    [] ? cgroup_tasks_write+0x48/0xc0
    [] ? cgroup_file_write+0xde/0x220
    [] ? do_page_fault+0x17d/0x3f0
    [] ? alloc_fd+0x2d/0xd0
    [] ? cgroup_file_write+0x0/0x220
    [] ? vfs_write+0x92/0xc0
    [] ? sys_write+0x41/0x70
    [] ? syscall_call+0x7/0xb
    Code: 03 00 74 09 8b 44 24 04 e8 1c f1 ff ff 89 73 04 8d 86 b0 00 00 00 b9 01 00 00 00 89 da 31 ff e8 65 f5 ff ff e9 4d ff ff ff 0f 0b 0b 0f 0b 0f 0b 90 8d b4 26 00 00 00 00 83 ec 10 8b 0d f4 e3
    EIP: [] mem_cgroup_move_account+0xe2/0xf0 SS:ESP 0068:c46f3e30
    ---[ end trace 7daa1582159b6532 ]---

    lock_page_cgroup and unlock_page_cgroup are implemented using
    bit_spinlock. bit_spinlock doesn't touch the bit if we are on non-SMP
    machine, so we can't use the bit to check whether the lock was taken.

    Let's introduce is_page_cgroup_locked based on bit_spin_is_locked instead
    of PageCgroupLocked to fix it.

    [akpm@linux-foundation.org: s/is_page_cgroup_locked/page_is_cgroup_locked/]
    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

28 May, 2010

1 commit

  • FILE_MAPPED per memcg of migrated file cache is not properly updated,
    because our hook in page_add_file_rmap() can't know to which memcg
    FILE_MAPPED should be counted.

    Basically, this patch is for fixing the bug but includes some big changes
    to fix up other messes.

    Now, at migrating mapped file, events happen in following sequence.

    1. allocate a new page.
    2. get memcg of an old page.
    3. charge ageinst a new page before migration. But at this point,
    no changes to new page's page_cgroup, no commit for the charge.
    (IOW, PCG_USED bit is not set.)
    4. page migration replaces radix-tree, old-page and new-page.
    5. page migration remaps the new page if the old page was mapped.
    6. Here, the new page is unlocked.
    7. memcg commits the charge for newpage, Mark the new page's page_cgroup
    as PCG_USED.

    Because "commit" happens after page-remap, we can count FILE_MAPPED
    at "5", because we should avoid to trust page_cgroup->mem_cgroup.
    if PCG_USED bit is unset.
    (Note: memcg's LRU removal code does that but LRU-isolation logic is used
    for helping it. When we overwrite page_cgroup->mem_cgroup, page_cgroup is
    not on LRU or page_cgroup->mem_cgroup is NULL.)

    We can lose file_mapped accounting information at 5 because FILE_MAPPED
    is updated only when mapcount changes 0->1. So we should catch it.

    BTW, historically, above implemntation comes from migration-failure
    of anonymous page. Because we charge both of old page and new page
    with mapcount=0, we can't catch
    - the page is really freed before remap.
    - migration fails but it's freed before remap
    or .....corner cases.

    New migration sequence with memcg is:

    1. allocate a new page.
    2. mark PageCgroupMigration to the old page.
    3. charge against a new page onto the old page's memcg. (here, new page's pc
    is marked as PageCgroupUsed.)
    4. page migration replaces radix-tree, page table, etc...
    5. At remapping, new page's page_cgroup is now makrked as "USED"
    We can catch 0->1 event and FILE_MAPPED will be properly updated.

    And we can catch SWAPOUT event after unlock this and freeing this
    page by unmap() can be caught.

    7. Clear PageCgroupMigration of the old page.

    So, FILE_MAPPED will be correctly updated.

    Then, for what MIGRATION flag is ?
    Without it, at migration failure, we may have to charge old page again
    because it may be fully unmapped. "charge" means that we have to dive into
    memory reclaim or something complated. So, it's better to avoid
    charge it again. Before this patch, __commit_charge() was working for
    both of the old/new page and fixed up all. But this technique has some
    racy condtion around FILE_MAPPED and SWAPOUT etc...
    Now, the kernel use MIGRATION flag and don't uncharge old page until
    the end of migration.

    I hope this change will make memcg's page migration much simpler. This
    page migration has caused several troubles. Worth to add a flag for
    simplification.

    Reviewed-by: Daisuke Nishimura
    Tested-by: Daisuke Nishimura
    Reported-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@linux-foundation.org
     

07 Apr, 2010

1 commit

  • Presently, memcg's FILE_MAPPED accounting has following race with
    move_account (happens at rmdir()).

    increment page->mapcount (rmap.c)
    mem_cgroup_update_file_mapped() move_account()
    lock_page_cgroup()
    check page_mapped() if
    page_mapped(page)>1 {
    FILE_MAPPED -1 from old memcg
    FILE_MAPPED +1 to old memcg
    }
    .....
    overwrite pc->mem_cgroup
    unlock_page_cgroup()
    lock_page_cgroup()
    FILE_MAPPED + 1 to pc->mem_cgroup
    unlock_page_cgroup()

    Then,
    old memcg (-1 file mapped)
    new memcg (+2 file mapped)

    This happens because move_account see page_mapped() which is not guarded
    by lock_page_cgroup(). This patch adds FILE_MAPPED flag to page_cgroup
    and move account information based on it. Now, all checks are synchronous
    with lock_page_cgroup().

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Balbir Singh
    Reviewed-by: Daisuke Nishimura
    Cc: Andrea Righi
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

13 Mar, 2010

1 commit

  • This patch is another core part of this move-charge-at-task-migration
    feature. It enables moving charges of anonymous swaps.

    To move the charge of swap, we need to exchange swap_cgroup's record.

    In current implementation, swap_cgroup's record is protected by:

    - page lock: if the entry is on swap cache.
    - swap_lock: if the entry is not on swap cache.

    This works well in usual swap-in/out activity.

    But this behavior make the feature of moving swap charge check many
    conditions to exchange swap_cgroup's record safely.

    So I changed modification of swap_cgroup's recored(swap_cgroup_record())
    to use xchg, and define a new function to cmpxchg swap_cgroup's record.

    This patch also enables moving charge of non pte_present but not uncharged
    swap caches, which can be exist on swap-out path, by getting the target
    pages via find_get_page() as do_mincore() does.

    [kosaki.motohiro@jp.fujitsu.com: fix ia64 build]
    [akpm@linux-foundation.org: fix typos]
    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: Daisuke Nishimura
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

16 Dec, 2009

1 commit

  • mem_cgroup_move_parent() calls try_charge first and cancel_charge on
    failure. IMHO, charge/uncharge(especially charge) is high cost operation,
    so we should avoid it as far as possible.

    This patch tries to delay try_charge in mem_cgroup_move_parent() by
    re-ordering checks it does.

    And this patch renames mem_cgroup_move_account() to
    __mem_cgroup_move_account(), changes the return value of
    __mem_cgroup_move_account() from int to void, and adds a new
    wrapper(mem_cgroup_move_account()), which checks whether a @pc is valid
    for moving account and calls __mem_cgroup_move_account().

    This patch removes the last caller of trylock_page_cgroup(), so removes
    its definition too.

    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

24 Sep, 2009

1 commit

  • Change the memory cgroup to remove the overhead associated with accounting
    all pages in the root cgroup. As a side-effect, we can no longer set a
    memory hard limit in the root cgroup.

    A new flag to track whether the page has been accounted or not has been
    added as well. Flags are now set atomically for page_cgroup,
    pcg_default_flags is now obsolete and removed.

    [akpm@linux-foundation.org: fix a few documentation glitches]
    Signed-off-by: Balbir Singh
    Signed-off-by: Daisuke Nishimura
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Li Zefan
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

20 Sep, 2009

1 commit


12 Jun, 2009

1 commit

  • Now, SLAB is configured in very early stage and it can be used in
    init routine now.

    But replacing alloc_bootmem() in FLAT/DISCONTIGMEM's page_cgroup()
    initialization breaks the allocation, now.
    (Works well in SPARSEMEM case...it supports MEMORY_HOTPLUG and
    size of page_cgroup is in reasonable size (< 1 << MAX_ORDER.)

    This patch revive FLATMEM+memory cgroup by using alloc_bootmem.

    In future,
    We stop to support FLATMEM (if no users) or rewrite codes for flatmem
    completely.But this will adds more messy codes and overheads.

    Reported-by: Li Zefan
    Tested-by: Li Zefan
    Tested-by: Ingo Molnar
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Pekka Enberg

    KAMEZAWA Hiroyuki
     

03 Apr, 2009

1 commit

  • Try to use CSS ID for records in swap_cgroup. By this, on 64bit machine,
    size of swap_cgroup goes down to 2 bytes from 8bytes.

    This means, when 2GB of swap is equipped, (assume the page size is 4096bytes)

    From size of swap_cgroup = 2G/4k * 8 = 4Mbytes.
    To size of swap_cgroup = 2G/4k * 2 = 1Mbytes.

    Reduction is large. Of course, there are trade-offs. This CSS ID will
    add overhead to swap-in/swap-out/swap-free.

    But in general,
    - swap is a resource which the user tend to avoid use.
    - If swap is never used, swap_cgroup area is not used.
    - Reading traditional manuals, size of swap should be proportional to
    size of memory. Memory size of machine is increasing now.

    I think reducing size of swap_cgroup makes sense.

    Note:
    - ID->CSS lookup routine has no locks, it's under RCU-Read-Side.
    - memcg can be obsolete at rmdir() but not freed while refcnt from
    swap_cgroup is available.

    Changelog v4->v5:
    - reworked on to memcg-charge-swapcache-to-proper-memcg.patch
    Changlog ->v4:
    - fixed not configured case.
    - deleted unnecessary comments.
    - fixed NULL pointer bug.
    - fixed message in dmesg.

    [nishimura@mxp.nes.nec.co.jp: css_tryget can be called twice in !PageCgroupUsed case]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Paul Menage
    Cc: Hugh Dickins
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

09 Jan, 2009

2 commits

  • A big patch for changing memcg's LRU semantics.

    Now,
    - page_cgroup is linked to mem_cgroup's its own LRU (per zone).

    - LRU of page_cgroup is not synchronous with global LRU.

    - page and page_cgroup is one-to-one and statically allocated.

    - To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as
    - lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc);

    - SwapCache is handled.

    And, when we handle LRU list of page_cgroup, we do following.

    pc = lookup_page_cgroup(page);
    lock_page_cgroup(pc); .....................(1)
    mz = page_cgroup_zoneinfo(pc);
    spin_lock(&mz->lru_lock);
    .....add to LRU
    spin_unlock(&mz->lru_lock);
    unlock_page_cgroup(pc);

    But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock.
    So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct.

    This is a trial to remove this dirty nesting of locks.
    This patch changes mz->lru_lock to be zone->lru_lock.
    Then, above sequence will be written as

    spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
    mem_cgroup_add/remove/etc_lru() {
    pc = lookup_page_cgroup(page);
    mz = page_cgroup_zoneinfo(pc);
    if (PageCgroupUsed(pc)) {
    ....add to LRU
    }
    spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU

    This is much simpler.
    (*) We're safe even if we don't take lock_page_cgroup(pc). Because..
    1. When pc->mem_cgroup can be modified.
    - at charge.
    - at account_move().
    2. at charge
    the PCG_USED bit is not set before pc->mem_cgroup is fixed.
    3. at account_move()
    the page is isolated and not on LRU.

    Pros.
    - easy for maintenance.
    - memcg can make use of laziness of pagevec.
    - we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup.
    - LRU status of memcg will be synchronized with global LRU's one.
    - # of locks are reduced.
    - account_move() is simplified very much.
    Cons.
    - may increase cost of LRU rotation.
    (no impact if memcg is not configured.)

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • For accounting swap, we need a record per swap entry, at least.

    This patch adds following function.
    - swap_cgroup_swapon() .... called from swapon
    - swap_cgroup_swapoff() ... called at the end of swapoff.

    - swap_cgroup_record() .... record information of swap entry.
    - swap_cgroup_lookup() .... lookup information of swap entry.

    This patch just implements "how to record information". No actual method
    for limit the usage of swap. These routine uses flat table to record and
    lookup. "wise" lookup system like radix-tree requires requires memory
    allocation at new records but swap-out is usually called under memory
    shortage (or memcg hits limit.) So, I used static allocation. (maybe
    dynamic allocation is not very hard but it adds additional memory
    allocation in memory shortage path.)

    Note1: In this, we use pointer to record information and this means
    8bytes per swap entry. I think we can reduce this when we
    create "id of cgroup" in the range of 0-65535 or 0-255.

    Reported-by: Daisuke Nishimura
    Reviewed-by: Daisuke Nishimura
    Tested-by: Daisuke Nishimura
    Reported-by: Hugh Dickins
    Reported-by: Balbir Singh
    Reported-by: Andrew Morton
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Pavel Emelianov
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

01 Dec, 2008

1 commit


23 Oct, 2008

1 commit

  • page_cgroup_init() is called from mem_cgroup_init(). But at this
    point, we cannot call alloc_bootmem().
    (and this caused panic at boot.)

    This patch moves page_cgroup_init() to init/main.c.

    Time table is following:
    ==
    parse_args(). # we can trust mem_cgroup_subsys.disabled bit after this.
    ....
    cgroup_init_early() # "early" init of cgroup.
    ....
    setup_arch() # memmap is allocated.
    ...
    page_cgroup_init();
    mem_init(); # we cannot call alloc_bootmem after this.
    ....
    cgroup_init() # mem_cgroup is initialized.
    ==

    Before page_cgroup_init(), mem_map must be initialized. So,
    I added page_cgroup_init() to init/main.c directly.

    (*) maybe this is not very clean but
    - cgroup_init_early() is too early
    - in cgroup_init(), we have to use vmalloc instead of alloc_bootmem().
    use of vmalloc area in x86-32 is important and we should avoid very large
    vmalloc() in x86-32. So, we want to use alloc_bootmem() and added page_cgroup_init()
    directly to init/main.c

    [akpm@linux-foundation.org: remove unneeded/bad mem_cgroup_subsys declaration]
    [akpm@linux-foundation.org: fix build]
    Acked-by: Balbir Singh
    Tested-by: Balbir Singh
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

20 Oct, 2008

1 commit

  • Allocate all page_cgroup at boot and remove page_cgroup poitner from
    struct page. This patch adds an interface as

    struct page_cgroup *lookup_page_cgroup(struct page*)

    All FLATMEM/DISCONTIGMEM/SPARSEMEM and MEMORY_HOTPLUG is supported.

    Remove page_cgroup pointer reduces the amount of memory by
    - 4 bytes per PAGE_SIZE.
    - 8 bytes per PAGE_SIZE
    if memory controller is disabled. (even if configured.)

    On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory.
    On my x86-64 server with 48GB of memory, this saves 96MB of memory.
    I think this reduction makes sense.

    By pre-allocation, kmalloc/kfree in charge/uncharge are removed.
    This means
    - we're not necessary to be afraid of kmalloc faiulre.
    (this can happen because of gfp_mask type.)
    - we can avoid calling kmalloc/kfree.
    - we can avoid allocating tons of small objects which can be fragmented.
    - we can know what amount of memory will be used for this extra-lru handling.

    I added printk message as

    "allocated %ld bytes of page_cgroup"
    "please try cgroup_disable=memory option if you don't want"

    maybe enough informative for users.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki