07 Jan, 2009

3 commits

  • If we add NOOP stubs for SetPageSwapCache() and ClearPageSwapCache(), then
    we can remove the #ifdef CONFIG_SWAPs from mm/migrate.c.

    Signed-off-by: Hugh Dickins
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • pp->page is never used when not set to the right page, so there is no need
    to set it to ZERO_PAGE(0) by default.

    Signed-off-by: Brice Goglin
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     
  • Rework do_pages_move() to work by page-sized chunks of struct page_to_node
    that are passed to do_move_page_to_node_array(). We now only have to
    allocate a single page instead a possibly very large vmalloc area to store
    all page_to_node entries.

    As a result, new_page_node() will now have a very small lookup, hidding
    much of the overall sys_move_pages() overhead.

    Signed-off-by: Brice Goglin
    Signed-off-by: Nathalie Furmento
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     

25 Dec, 2008

1 commit


17 Dec, 2008

1 commit

  • Commit 80bba1290ab5122c60cdb73332b26d288dc8aedd removed one necessary
    variable initialization. As a result following warning happened:

    CC mm/migrate.o
    mm/migrate.c: In function 'sys_move_pages':
    mm/migrate.c:1001: warning: 'err' may be used uninitialized in this function

    More unfortunately, if find_vma() failed, kernel read uninitialized
    memory.

    Signed-off-by: KOSAKI Motohiro
    CC: Brice Goglin
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

11 Dec, 2008

1 commit

  • Since commit 2f007e74bb85b9fc4eab28524052161703300f1a, do_pages_stat()
    gets the page address from user-space and puts the corresponding status
    back while holding the mmap_sem for read. There is no need to hold
    mmap_sem there while some page-faults may occur.

    This patch adds a temporary address and status buffer so as to only
    hold mmap_sem while working on these kernel buffers. This is
    implemented by extracting do_pages_stat_array() out of do_pages_stat().

    Signed-off-by: Brice Goglin
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     

04 Dec, 2008

1 commit


20 Nov, 2008

1 commit

  • Page migration's writeout() has got understandably confused by the nasty
    AOP_WRITEPAGE_ACTIVATE case: as in normal success, a writepage() error has
    unlocked the page, so writeout() then needs to relock it.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

14 Nov, 2008

4 commits

  • Conflicts:
    security/keys/internal.h
    security/keys/process_keys.c
    security/keys/request_key.c

    Fixed conflicts above by using the non 'tsk' versions.

    Signed-off-by: James Morris

    James Morris
     
  • Use RCU to access another task's creds and to release a task's own creds.
    This means that it will be possible for the credentials of a task to be
    replaced without another task (a) requiring a full lock to read them, and (b)
    seeing deallocated memory.

    Signed-off-by: David Howells
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    David Howells
     
  • Separate the task security context from task_struct. At this point, the
    security data is temporarily embedded in the task_struct with two pointers
    pointing to it.

    Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
    entry.S via asm-offsets.

    With comment fixes Signed-off-by: Marc Dionne

    Signed-off-by: David Howells
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    David Howells
     
  • Wrap access to task credentials so that they can be separated more easily from
    the task_struct during the introduction of COW creds.

    Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().

    Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
    sense to use RCU directly rather than a convenient wrapper; these will be
    addressed by later patches.

    Signed-off-by: David Howells
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn
    Cc: Al Viro
    Cc: linux-audit@redhat.com
    Cc: containers@lists.linux-foundation.org
    Cc: linux-mm@kvack.org
    Signed-off-by: James Morris

    David Howells
     

07 Nov, 2008

1 commit

  • Move the migrate_prep outside the mmap_sem for the following system calls

    1. sys_move_pages
    2. sys_migrate_pages
    3. sys_mbind()

    It really does not matter when we flush the lru. The system is free to
    add pages onto the lru even during migration which will make the page
    migration either skip the page (mbind, migrate_pages) or return a busy
    state (move_pages).

    Fixes this lockdep warning (and potential deadlock):

    Some VM place has
    mmap_sem -> kevent_wq via lru_add_drain_all()

    net/core/dev.c::dev_ioctl() has
    rtnl_lock -> mmap_sem (*) the ioctl has copy_from_user() and it can do page fault.

    linkwatch_event has
    kevent_wq -> rtnl_lock

    Signed-off-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    Reported-by: Heiko Carstens
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

20 Oct, 2008

9 commits

  • This patch tries to make page->mapping to be NULL before
    mem_cgroup_uncharge_cache_page() is called.

    "page->mapping == NULL" is a good check for "whether the page is still
    radix-tree or not". This patch also adds BUG_ON() to
    mem_cgroup_uncharge_cache_page();

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • To prepare the chunking, move the sys_move_pages() code that is used when
    nodes!=NULL into do_pages_move(). And rename do_move_pages() into
    do_move_page_to_node_array().

    Signed-off-by: Brice Goglin
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     
  • do_pages_stat() does not need any page_to_node entry for real. Just pass
    the pointers to the user-space page address array and to the user-space
    status array, and have do_pages_stat() traverse the former and fill the
    latter directly.

    Signed-off-by: Brice Goglin
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     
  • A patchset reworking sys_move_pages(). It removes the possibly large
    vmalloc by using multiple chunks when migrating large buffers. It also
    dramatically increases the throughput for large buffers since the lookup
    in new_page_node() is now limited to a single chunk, causing the quadratic
    complexity to have a much slower impact. There is no need to use any
    radix-tree-like structure to improve this lookup.

    sys_move_pages() duration on a 4-quadcore-opteron 2347HE (1.9Gz),
    migrating between nodes #2 and #3:

    length move_pages (us) move_pages+patch (us)
    4kB 126 98
    40kB 198 168
    400kB 963 937
    4MB 12503 11930
    40MB 246867 11848

    Patches #1 and #4 are the important ones:
    1) stop returning -ENOENT from sys_move_pages() if nothing got migrated
    2) don't vmalloc a huge page_to_node array for do_pages_stat()
    3) extract do_pages_move() out of sys_move_pages()
    4) rework do_pages_move() to work on page_sized chunks
    5) move_pages: no need to set pp->page to ZERO_PAGE(0) by default

    This patch:

    There is no point in returning -ENOENT from sys_move_pages() if all pages
    were already on the right node, while we return 0 if only 1 page was not.
    Most application don't know where their pages are allocated, so it's not
    an error to try to migrate them anyway.

    Just return 0 and let the status array in user-space be checked if the
    application needs details.

    It will make the upcoming chunked-move_pages() support much easier.

    Signed-off-by: Brice Goglin
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     
  • Make sure that mlocked pages also live on the unevictable LRU, so kswapd
    will not scan them over and over again.

    This is achieved through various strategies:

    1) add yet another page flag--PG_mlocked--to indicate that
    the page is locked for efficient testing in vmscan and,
    optionally, fault path. This allows early culling of
    unevictable pages, preventing them from getting to
    page_referenced()/try_to_unmap(). Also allows separate
    accounting of mlock'd pages, as Nick's original patch
    did.

    Note: Nick's original mlock patch used a PG_mlocked
    flag. I had removed this in favor of the PG_unevictable
    flag + an mlock_count [new page struct member]. I
    restored the PG_mlocked flag to eliminate the new
    count field.

    2) add the mlock/unevictable infrastructure to mm/mlock.c,
    with internal APIs in mm/internal.h. This is a rework
    of Nick's original patch to these files, taking into
    account that mlocked pages are now kept on unevictable
    LRU list.

    3) update vmscan.c:page_evictable() to check PageMlocked()
    and, if vma passed in, the vm_flags. Note that the vma
    will only be passed in for new pages in the fault path;
    and then only if the "cull unevictable pages in fault
    path" patch is included.

    4) add try_to_unlock() to rmap.c to walk a page's rmap and
    ClearPageMlocked() if no other vmas have it mlocked.
    Reuses as much of try_to_unmap() as possible. This
    effectively replaces the use of one of the lru list links
    as an mlock count. If this mechanism let's pages in mlocked
    vmas leak through w/o PG_mlocked set [I don't know that it
    does], we should catch them later in try_to_unmap(). One
    hopes this will be rare, as it will be relatively expensive.

    Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
    Signed-off-by: Nick Piggin

    splitlru: introduce __get_user_pages():

    New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
    because current get_user_pages() can't grab PROT_NONE pages theresore it
    cause PROT_NONE pages can't munlock.

    [akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
    [akpm@linux-foundation.org: untangle patch interdependencies]
    [akpm@linux-foundation.org: fix things after out-of-order merging]
    [hugh@veritas.com: fix page-flags mess]
    [lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
    [kosaki.motohiro@jp.fujitsu.com: build fix]
    [kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
    [kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Matt Mackall
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • When the system contains lots of mlocked or otherwise unevictable pages,
    the pageout code (kswapd) can spend lots of time scanning over these
    pages. Worse still, the presence of lots of unevictable pages can confuse
    kswapd into thinking that more aggressive pageout modes are required,
    resulting in all kinds of bad behaviour.

    Infrastructure to manage pages excluded from reclaim--i.e., hidden from
    vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked to
    maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
    them from vmscan.

    Kosaki Motohiro added the support for the memory controller unevictable
    lru list.

    Pages on the unevictable list have both PG_unevictable and PG_lru set.
    Thus, PG_unevictable is analogous to and mutually exclusive with
    PG_active--it specifies which LRU list the page is on.

    The unevictable infrastructure is enabled by a new mm Kconfig option
    [CONFIG_]UNEVICTABLE_LRU.

    A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
    not a page may be evictable. Subsequent patches will add the various
    !evictable tests. We'll want to keep these tests light-weight for use in
    shrink_active_list() and, possibly, the fault path.

    To avoid races between tasks putting pages [back] onto an LRU list and
    tasks that might be moving the page from non-evictable to evictable state,
    the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
    -- tests the "evictability" of a page after placing it on the LRU, before
    dropping the reference. If the page has become unevictable,
    putback_lru_page() will redo the 'putback', thus moving the page to the
    unevictable list. This way, we avoid "stranding" evictable pages on the
    unevictable list.

    [akpm@linux-foundation.org: fix fallout from out-of-order merge]
    [riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
    [nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
    [kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
    [kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
    [kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
    [kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
    [kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Debugged-by: Benjamin Kidwell
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Define page_file_cache() function to answer the question:
    is page backed by a file?

    Originally part of Rik van Riel's split-lru patch. Extracted to make
    available for other, independent reclaim patches.

    Moved inline function to linux/mm_inline.h where it will be needed by
    subsequent "split LRU" and "noreclaim" patches.

    Unfortunately this needs to use a page flag, since the PG_swapbacked state
    needs to be preserved all the way to the point where the page is last
    removed from the LRU. Trying to derive the status from other info in the
    page resulted in wrong VM statistics in earlier split VM patchsets.

    The total number of page flags in use on a 32 bit machine after this patch
    is 19.

    [akpm@linux-foundation.org: fix up out-of-order merge fallout]
    [hugh@veritas.com: splitlru: shmem_getpage SetPageSwapBacked sooner[
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: MinChan Kim
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Turn the pagevecs into an array just like the LRUs. This significantly
    cleans up the source code and reduces the size of the kernel by about 13kB
    after all the LRU lists have been created further down in the split VM
    patch series.

    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • On large memory systems, the VM can spend way too much time scanning
    through pages that it cannot (or should not) evict from memory. Not only
    does it use up CPU time, but it also provokes lock contention and can
    leave large systems under memory presure in a catatonic state.

    This patch series improves VM scalability by:

    1) putting filesystem backed, swap backed and unevictable pages
    onto their own LRUs, so the system only scans the pages that it
    can/should evict from memory

    2) switching to two handed clock replacement for the anonymous LRUs,
    so the number of pages that need to be scanned when the system
    starts swapping is bound to a reasonable number

    3) keeping unevictable pages off the LRU completely, so the
    VM does not waste CPU time scanning them. ramfs, ramdisk,
    SHM_LOCKED shared memory segments and mlock()ed VMA pages
    are keept on the unevictable list.

    This patch:

    isolate_lru_page logically belongs to be in vmscan.c than migrate.c.

    It is tough, because we don't need that function without memory migration
    so there is a valid argument to have it in migrate.c. However a
    subsequent patch needs to make use of it in the core mm, so we can happily
    move it to vmscan.c.

    Also, make the function a little more generic by not requiring that it
    adds an isolated page to a given list. Callers can do that.

    Note that we now have '__isolate_lru_page()', that does
    something quite different, visible outside of vmscan.c
    for use with memory controller. Methinks we need to
    rationalize these names/purposes. --lts

    [akpm@linux-foundation.org: fix mm/memory_hotplug.c build]
    Signed-off-by: Nick Piggin
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

05 Aug, 2008

1 commit

  • Converting page lock to new locking bitops requires a change of page flag
    operation naming, so we might as well convert it to something nicer
    (!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

    This also facilitates lockdeping of page lock.

    Signed-off-by: Nick Piggin
    Acked-by: KOSAKI Motohiro
    Acked-by: Peter Zijlstra
    Acked-by: Andrew Morton
    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

27 Jul, 2008

2 commits

  • mapping->tree_lock has no read lockers. convert the lock from an rwlock
    to a spinlock.

    Signed-off-by: Nick Piggin
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Hugh Dickins
    Cc: "Paul E. McKenney"
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • If we can be sure that elevating the page_count on a pagecache page will
    pin it, we can speculatively run this operation, and subsequently check to
    see if we hit the right page rather than relying on holding a lock or
    otherwise pinning a reference to the page.

    This can be done if get_page/put_page behaves consistently throughout the
    whole tree (ie. if we "get" the page after it has been used for something
    else, we must be able to free it with a put_page).

    Actually, there is a period where the count behaves differently: when the
    page is free or if it is a constituent page of a compound page. We need
    an atomic_inc_not_zero operation to ensure we don't try to grab the page
    in either case.

    This patch introduces the core locking protocol to the pagecache (ie.
    adds page_cache_get_speculative, and tweaks some update-side code to make
    it work).

    Thanks to Hugh for pointing out an improvement to the algorithm setting
    page_count to zero when we have control of all references, in order to
    hold off speculative getters.

    [kamezawa.hiroyu@jp.fujitsu.com: fix migration_entry_wait()]
    [hugh@veritas.com: fix add_to_page_cache]
    [akpm@linux-foundation.org: repair a comment]
    Signed-off-by: Nick Piggin
    Cc: Jeff Garzik
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Hugh Dickins
    Cc: "Paul E. McKenney"
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

26 Jul, 2008

2 commits

  • memcg: performance improvements

    Patch Description
    1/5 ... remove refcnt fron page_cgroup patch (shmem handling is fixed)
    2/5 ... swapcache handling patch
    3/5 ... add helper function for shmem's memory reclaim patch
    4/5 ... optimize by likely/unlikely ppatch
    5/5 ... remove redundunt check patch (shmem handling is fixed.)

    Unix bench result.

    == 2.6.26-rc2-mm1 + memory resource controller
    Execl Throughput 2915.4 lps (29.6 secs, 3 samples)
    C Compiler Throughput 1019.3 lpm (60.0 secs, 3 samples)
    Shell Scripts (1 concurrent) 5796.0 lpm (60.0 secs, 3 samples)
    Shell Scripts (8 concurrent) 1097.7 lpm (60.0 secs, 3 samples)
    Shell Scripts (16 concurrent) 565.3 lpm (60.0 secs, 3 samples)
    File Read 1024 bufsize 2000 maxblocks 1022128.0 KBps (30.0 secs, 3 samples)
    File Write 1024 bufsize 2000 maxblocks 544057.0 KBps (30.0 secs, 3 samples)
    File Copy 1024 bufsize 2000 maxblocks 346481.0 KBps (30.0 secs, 3 samples)
    File Read 256 bufsize 500 maxblocks 319325.0 KBps (30.0 secs, 3 samples)
    File Write 256 bufsize 500 maxblocks 148788.0 KBps (30.0 secs, 3 samples)
    File Copy 256 bufsize 500 maxblocks 99051.0 KBps (30.0 secs, 3 samples)
    File Read 4096 bufsize 8000 maxblocks 2058917.0 KBps (30.0 secs, 3 samples)
    File Write 4096 bufsize 8000 maxblocks 1606109.0 KBps (30.0 secs, 3 samples)
    File Copy 4096 bufsize 8000 maxblocks 854789.0 KBps (30.0 secs, 3 samples)
    Dc: sqrt(2) to 99 decimal places 126145.2 lpm (30.0 secs, 3 samples)

    INDEX VALUES
    TEST BASELINE RESULT INDEX

    Execl Throughput 43.0 2915.4 678.0
    File Copy 1024 bufsize 2000 maxblocks 3960.0 346481.0 875.0
    File Copy 256 bufsize 500 maxblocks 1655.0 99051.0 598.5
    File Copy 4096 bufsize 8000 maxblocks 5800.0 854789.0 1473.8
    Shell Scripts (8 concurrent) 6.0 1097.7 1829.5
    =========
    FINAL SCORE 991.3

    == 2.6.26-rc2-mm1 + this set ==
    Execl Throughput 3012.9 lps (29.9 secs, 3 samples)
    C Compiler Throughput 981.0 lpm (60.0 secs, 3 samples)
    Shell Scripts (1 concurrent) 5872.0 lpm (60.0 secs, 3 samples)
    Shell Scripts (8 concurrent) 1120.3 lpm (60.0 secs, 3 samples)
    Shell Scripts (16 concurrent) 578.0 lpm (60.0 secs, 3 samples)
    File Read 1024 bufsize 2000 maxblocks 1003993.0 KBps (30.0 secs, 3 samples)
    File Write 1024 bufsize 2000 maxblocks 550452.0 KBps (30.0 secs, 3 samples)
    File Copy 1024 bufsize 2000 maxblocks 347159.0 KBps (30.0 secs, 3 samples)
    File Read 256 bufsize 500 maxblocks 314644.0 KBps (30.0 secs, 3 samples)
    File Write 256 bufsize 500 maxblocks 151852.0 KBps (30.0 secs, 3 samples)
    File Copy 256 bufsize 500 maxblocks 101000.0 KBps (30.0 secs, 3 samples)
    File Read 4096 bufsize 8000 maxblocks 2033256.0 KBps (30.0 secs, 3 samples)
    File Write 4096 bufsize 8000 maxblocks 1611814.0 KBps (30.0 secs, 3 samples)
    File Copy 4096 bufsize 8000 maxblocks 847979.0 KBps (30.0 secs, 3 samples)
    Dc: sqrt(2) to 99 decimal places 128148.7 lpm (30.0 secs, 3 samples)

    INDEX VALUES
    TEST BASELINE RESULT INDEX

    Execl Throughput 43.0 3012.9 700.7
    File Copy 1024 bufsize 2000 maxblocks 3960.0 347159.0 876.7
    File Copy 256 bufsize 500 maxblocks 1655.0 101000.0 610.3
    File Copy 4096 bufsize 8000 maxblocks 5800.0 847979.0 1462.0
    Shell Scripts (8 concurrent) 6.0 1120.3 1867.2
    =========
    FINAL SCORE 1004.6

    This patch:

    Remove refcnt from page_cgroup().

    After this,

    * A page is charged only when !page_mapped() && no page_cgroup is assigned.
    * Anon page is newly mapped.
    * File page is added to mapping->tree.

    * A page is uncharged only when
    * Anon page is fully unmapped.
    * File page is removed from LRU.

    There is no change in behavior from user's view.

    This patch also removes unnecessary calls in rmap.c which was used only for
    refcnt mangement.

    [akpm@linux-foundation.org: fix warning]
    [hugh@veritas.com: fix shmem_unuse_inode charging]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Li Zefan
    Cc: Hugh Dickins
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Cc: David Rientjes
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch changes page migration under memory controller to use a
    different algorithm. (thanks to Christoph for new idea.)

    Before:
    - page_cgroup is migrated from an old page to a new page.
    After:
    - a new page is accounted , no reuse of page_cgroup.

    Pros:

    - We can avoid compliated lock depndencies and races in migration.

    Cons:

    - new param to mem_cgroup_charge_common().

    - mem_cgroup_getref() is added for handling ref_cnt ping-pong.

    This version simplifies complicated lock dependency in page migraiton
    under memory resource controller.

    new refcnt sequence is following.

    a mapped page:
    prepage_migration() ..... +1 to NEW page
    try_to_unmap() ..... all refs to OLD page is gone.
    move_pages() ..... +1 to NEW page if page cache.
    remap... ..... all refs from *map* is added to NEW one.
    end_migration() ..... -1 to New page.

    page's mapcount + (page_is_cache) refs are added to NEW one.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: Li Zefan
    Cc: YAMAMOTO Takashi
    Cc: Hugh Dickins
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

25 Jul, 2008

2 commits

  • We'd like to support CONFIG_MEMORY_HOTREMOVE on s390, which depends on
    CONFIG_MIGRATION. So far, CONFIG_MIGRATION is only available with NUMA
    support.

    This patch makes CONFIG_MIGRATION selectable for architectures that define
    ARCH_ENABLE_MEMORY_HOTREMOVE. When MIGRATION is enabled w/o NUMA, the
    kernel won't compile because migrate_vmas() does not know about
    vm_ops->migrate() and vma_migratable() does not know about policy_zone.
    To fix this, those two functions can be restricted to '#ifdef CONFIG_NUMA'
    because they are not being used w/o NUMA. vma_migratable() is moved over
    from migrate.h to mempolicy.h.

    [kosaki.motohiro@jp.fujitsu.com: build fix]
    Acked-by: Christoph Lameter
    Signed-off-by: Gerald Schaefer
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: KOSAKI Motorhiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • Every file should include the headers containing the externs for its
    global functions (in this case for sys_move_pages()).

    Signed-off-by: Adrian Bunk
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     

05 Jul, 2008

1 commit

  • Remove all clameter@sgi.com addresses from the kernel tree since they will
    become invalid on June 27th. Change my maintainer email address for the
    slab allocators to cl@linux-foundation.org (which will be the new email
    address for the future).

    Signed-off-by: Christoph Lameter
    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Stephen Rothwell
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

21 Jun, 2008

1 commit

  • KAMEZAWA Hiroyuki and Oleg Nesterov point out that since the commit
    557ed1fa2620dc119adb86b34c614e152a629a80 ("remove ZERO_PAGE") removed
    the ZERO_PAGE from the VM mappings, any users of get_user_pages() will
    generally now populate the VM with real empty pages needlessly.

    We used to get the ZERO_PAGE when we did the "handle_mm_fault()", but
    since fault handling no longer uses ZERO_PAGE for new anonymous pages,
    we now need to handle that special case in follow_page() instead.

    In particular, the removal of ZERO_PAGE effectively removed the core
    file writing optimization where we would skip writing pages that had not
    been populated at all, and increased memory pressure a lot by allocating
    all those useless newly zeroed pages.

    This reinstates the optimization by making the unmapped PTE case the
    same as for a non-existent page table, which already did this correctly.

    While at it, this also fixes the XIP case for follow_page(), where the
    caller could not differentiate between the case of a page that simply
    could not be used (because it had no "struct page" associated with it)
    and a page that just wasn't mapped.

    We do that by simply returning an error pointer for pages that could not
    be turned into a "struct page *". The error is arbitrarily picked to be
    EFAULT, since that was what get_user_pages() already used for the
    equivalent IO-mapped page case.

    [ Also removed an impossible test for pte_offset_map_lock() failing:
    that's not how that function works ]

    Acked-by: Oleg Nesterov
    Acked-by: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Roland McGrath
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

30 Apr, 2008

1 commit

  • KAMEZAWA Hiroyuki found a warning message in the buffer dirtying code that
    is coming from page migration caller.

    WARNING: at fs/buffer.c:720 __set_page_dirty+0x330/0x360()
    Call Trace:
    [] show_stack+0x80/0xa0
    [] dump_stack+0x30/0x60
    [] warn_on_slowpath+0x90/0xe0
    [] __set_page_dirty+0x330/0x360
    [] __set_page_dirty_buffers+0xd0/0x280
    [] set_page_dirty+0xc0/0x260
    [] migrate_page_copy+0x5d0/0x5e0
    [] buffer_migrate_page+0x2e0/0x3c0
    [] migrate_pages+0x770/0xe00

    What was happening is that migrate_page_copy wants to transfer the PG_dirty
    bit from old page to new page, so what it would do is set_page_dirty(newpage).
    However set_page_dirty() is used to set the entire page dirty, wheras in
    this case, only part of the page was dirty, and it also was not uptodate.

    Marking the whole page dirty with set_page_dirty would lead to corruption or
    unresolvable conditions -- a dirty && !uptodate page and dirty && !uptodate
    buffers.

    Possibly we could just ClearPageDirty(oldpage); SetPageDirty(newpage);
    however in the interests of keeping the change minimal...

    Signed-off-by: Nick Piggin
    Tested-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

05 Mar, 2008

1 commit

  • Page migration gave me free_hot_cold_page's VM_BUG_ON page->page_cgroup.
    remove_migration_pte was calling mem_cgroup_charge on the new page whenever it
    found a swap pte, before it had determined it to be a migration entry. That
    left a surplus reference count on the page_cgroup, so it was still attached
    when the page was later freed.

    Move that mem_cgroup_charge down to where we're sure it's a migration entry.
    We were already under i_mmap_lock or anon_vma->lock, so its GFP_KERNEL was
    already inappropriate: change that to GFP_ATOMIC.

    It's essential that remove_migration_pte removes all the migration entries,
    other crashes follow if not. So proceed even when the charge fails: normally
    it cannot, but after a mem_cgroup_force_empty it might - comment in the code.

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

08 Feb, 2008

3 commits

  • While using memory control cgroup, page-migration under it works as following.
    ==
    1. uncharge all refs at try to unmap.
    2. charge regs again remove_migration_ptes()
    ==
    This is simple but has following problems.
    ==
    The page is uncharged and charged back again if *mapped*.
    - This means that cgroup before migration can be different from one after
    migration
    - If page is not mapped but charged as page cache, charge is just ignored
    (because not mapped, it will not be uncharged before migration)
    This is memory leak.
    ==
    This patch tries to keep memory cgroup at page migration by increasing
    one refcnt during it. 3 functions are added.

    mem_cgroup_prepare_migration() --- increase refcnt of page->page_cgroup
    mem_cgroup_end_migration() --- decrease refcnt of page->page_cgroup
    mem_cgroup_page_migration() --- copy page->page_cgroup from old page to
    new page.

    During migration
    - old page is under PG_locked.
    - new page is under PG_locked, too.
    - both old page and new page is not on LRU.

    These 3 facts guarantee that page_cgroup() migration has no race.

    Tested and worked well in x86_64/fake-NUMA box.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Nick Piggin pointed out that swap cache and page cache addition routines
    could be called from non GFP_KERNEL contexts. This patch makes the
    charging routine aware of the gfp context. Charging might fail if the
    cgroup is over it's limit, in which case a suitable error is returned.

    This patch was tested on a Powerpc box. I am still looking at being able
    to test the path, through which allocations happen in non GFP_KERNEL
    contexts.

    [kamezawa.hiroyu@jp.fujitsu.com: problem with ZONE_MOVABLE]
    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Add the accounting hooks. The accounting is carried out for RSS and Page
    Cache (unmapped) pages. There is now a common limit and accounting for both.
    The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap()
    time. Page cache is accounted at add_to_page_cache(),
    __delete_from_page_cache(). Swap cache is also accounted for.

    Each page's page_cgroup is protected with the last bit of the
    page_cgroup pointer, this makes handling of race conditions involving
    simultaneous mappings of a page easier. A reference count is kept in the
    page_cgroup to deal with cases where a page might be unmapped from the RSS
    of all tasks, but still lives in the page cache.

    Credits go to Vaidyanathan Srinivasan for helping with reference counting work
    of the page cgroup. Almost all of the page cache accounting code has help
    from Vaidyanathan Srinivasan.

    [hugh@veritas.com: fix swapoff breakage]
    [akpm@linux-foundation.org: fix locking]
    Signed-off-by: Vaidyanathan Srinivasan
    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc:
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

06 Feb, 2008

2 commits

  • Orphaned page might have fs-private metadata, the page is truncated. As
    the page hasn't mapping, page migration refuse to migrate the page. It
    appears the page is only freed in page reclaim and if zone watermark is
    low, the page is never freed, as a result migration always fail. I thought
    we could free the metadata so such page can be freed in migration and make
    migration more reliable.

    [akpm@linux-foundation.org: go direct to try_to_free_buffers()]
    Signed-off-by: Shaohua Li
    Acked-by: Nick Piggin
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Move is_swap_pte helper function to swapops.h for use by pagemap code

    Signed-off-by: Matt Mackall
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     

20 Oct, 2007

2 commits

  • Typo fixes retrun -> return

    Signed-off-by: Gabriel Craciunescu
    Signed-off-by: Adrian Bunk

    Gabriel Craciunescu
     
  • The find_task_by_something is a set of macros are used to find task by pid
    depending on what kind of pid is proposed - global or virtual one. All of
    them are wrappers above the most generic one - find_task_by_pid_type_ns() -
    and just substitute some args for it.

    It turned out, that dereferencing the current->nsproxy->pid_ns construction
    and pushing one more argument on the stack inline cause kernel text size to
    grow.

    This patch moves all this stuff out-of-line into kernel/pid.c. Together
    with the next patch it saves a bit less than 400 bytes from the .text
    section.

    Signed-off-by: Pavel Emelyanov
    Cc: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov