20 Oct, 2008

40 commits

  • This patch implements a new freezer subsystem in the control groups
    framework. It provides a way to stop and resume execution of all tasks in
    a cgroup by writing in the cgroup filesystem.

    The freezer subsystem in the container filesystem defines a file named
    freezer.state. Writing "FROZEN" to the state file will freeze all tasks
    in the cgroup. Subsequently writing "RUNNING" will unfreeze the tasks in
    the cgroup. Reading will return the current state.

    * Examples of usage :

    # mkdir /containers/freezer
    # mount -t cgroup -ofreezer freezer /containers
    # mkdir /containers/0
    # echo $some_pid > /containers/0/tasks

    to get status of the freezer subsystem :

    # cat /containers/0/freezer.state
    RUNNING

    to freeze all tasks in the container :

    # echo FROZEN > /containers/0/freezer.state
    # cat /containers/0/freezer.state
    FREEZING
    # cat /containers/0/freezer.state
    FROZEN

    to unfreeze all tasks in the container :

    # echo RUNNING > /containers/0/freezer.state
    # cat /containers/0/freezer.state
    RUNNING

    This is the basic mechanism which should do the right thing for user space
    task in a simple scenario.

    It's important to note that freezing can be incomplete. In that case we
    return EBUSY. This means that some tasks in the cgroup are busy doing
    something that prevents us from completely freezing the cgroup at this
    time. After EBUSY, the cgroup will remain partially frozen -- reflected
    by freezer.state reporting "FREEZING" when read. The state will remain
    "FREEZING" until one of these things happens:

    1) Userspace cancels the freezing operation by writing "RUNNING" to
    the freezer.state file
    2) Userspace retries the freezing operation by writing "FROZEN" to
    the freezer.state file (writing "FREEZING" is not legal
    and returns EIO)
    3) The tasks that blocked the cgroup from entering the "FROZEN"
    state disappear from the cgroup's set of tasks.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: export thaw_process]
    Signed-off-by: Cedric Le Goater
    Signed-off-by: Matt Helsley
    Acked-by: Serge E. Hallyn
    Tested-by: Matt Helsley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Helsley
     
  • Now that the TIF_FREEZE flag is available in all architectures, extract
    the refrigerator() and freeze_task() from kernel/power/process.c and make
    it available to all.

    The refrigerator() can now be used in a control group subsystem
    implementing a control group freezer.

    Signed-off-by: Cedric Le Goater
    Signed-off-by: Matt Helsley
    Acked-by: Serge E. Hallyn
    Tested-by: Matt Helsley
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Helsley
     
  • This patch series introduces a cgroup subsystem that utilizes the swsusp
    freezer to freeze a group of tasks. It's immediately useful for batch job
    management scripts. It should also be useful in the future for
    implementing container checkpoint/restart.

    The freezer subsystem in the container filesystem defines a cgroup file
    named freezer.state. Reading freezer.state will return the current state
    of the cgroup. Writing "FROZEN" to the state file will freeze all tasks
    in the cgroup. Subsequently writing "RUNNING" will unfreeze the tasks in
    the cgroup.

    * Examples of usage :

    # mkdir /containers/freezer
    # mount -t cgroup -ofreezer freezer /containers
    # mkdir /containers/0
    # echo $some_pid > /containers/0/tasks

    to get status of the freezer subsystem :

    # cat /containers/0/freezer.state
    RUNNING

    to freeze all tasks in the container :

    # echo FROZEN > /containers/0/freezer.state
    # cat /containers/0/freezer.state
    FREEZING
    # cat /containers/0/freezer.state
    FROZEN

    to unfreeze all tasks in the container :

    # echo RUNNING > /containers/0/freezer.state
    # cat /containers/0/freezer.state
    RUNNING

    This patch:

    The first step in making the refrigerator() available to all
    architectures, even for those without power management.

    The purpose of such a change is to be able to use the refrigerator() in a
    new control group subsystem which will implement a control group freezer.

    [akpm@linux-foundation.org: fix sparc]
    Signed-off-by: Cedric Le Goater
    Signed-off-by: Matt Helsley
    Acked-by: Pavel Machek
    Acked-by: Serge E. Hallyn
    Acked-by: Rafael J. Wysocki
    Acked-by: Nigel Cunningham
    Tested-by: Matt Helsley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Helsley
     
  • To prepare the chunking, move the sys_move_pages() code that is used when
    nodes!=NULL into do_pages_move(). And rename do_move_pages() into
    do_move_page_to_node_array().

    Signed-off-by: Brice Goglin
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     
  • do_pages_stat() does not need any page_to_node entry for real. Just pass
    the pointers to the user-space page address array and to the user-space
    status array, and have do_pages_stat() traverse the former and fill the
    latter directly.

    Signed-off-by: Brice Goglin
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     
  • A patchset reworking sys_move_pages(). It removes the possibly large
    vmalloc by using multiple chunks when migrating large buffers. It also
    dramatically increases the throughput for large buffers since the lookup
    in new_page_node() is now limited to a single chunk, causing the quadratic
    complexity to have a much slower impact. There is no need to use any
    radix-tree-like structure to improve this lookup.

    sys_move_pages() duration on a 4-quadcore-opteron 2347HE (1.9Gz),
    migrating between nodes #2 and #3:

    length move_pages (us) move_pages+patch (us)
    4kB 126 98
    40kB 198 168
    400kB 963 937
    4MB 12503 11930
    40MB 246867 11848

    Patches #1 and #4 are the important ones:
    1) stop returning -ENOENT from sys_move_pages() if nothing got migrated
    2) don't vmalloc a huge page_to_node array for do_pages_stat()
    3) extract do_pages_move() out of sys_move_pages()
    4) rework do_pages_move() to work on page_sized chunks
    5) move_pages: no need to set pp->page to ZERO_PAGE(0) by default

    This patch:

    There is no point in returning -ENOENT from sys_move_pages() if all pages
    were already on the right node, while we return 0 if only 1 page was not.
    Most application don't know where their pages are allocated, so it's not
    an error to try to migrate them anyway.

    Just return 0 and let the status array in user-space be checked if the
    application needs details.

    It will make the upcoming chunked-move_pages() support much easier.

    Signed-off-by: Brice Goglin
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     
  • During hotplug memory remove, memory regions should be released on a
    PAGES_PER_SECTION size chunks. This mirrors the code in add_memory where
    resources are requested on a PAGES_PER_SECTION size.

    Attempting to release the entire memory region fails because there is not
    a single resource for the total number of pages being removed. Instead
    the resources for the pages are split in PAGES_PER_SECTION size chunks as
    requested during memory add.

    Signed-off-by: Nathan Fontenot
    Signed-off-by: Badari Pulavarty
    Acked-by: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nathan Fontenot
     
  • The current documentation of dirty_ratio and dirty_background_ratio is a
    bit misleading.

    In the documentation we say that they are "a percentage of total system
    memory", but the current page writeback policy, intead, is to apply the
    percentages to the dirtyable memory, that means free pages + reclaimable
    pages.

    Better to be more explicit to clarify this concept.

    Signed-off-by: Andrea Righi
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Righi
     
  • This attribute just has a write operation.

    [akpm@linux-foundation.org: use S_IWUSR as suggested by Randy]
    Signed-off-by: Shaohua Li
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • This replaces zone->lru_lock in setup_per_zone_pages_min() with zone->lock.
    There seems to be no need for the lru_lock anymore, but there is a need for
    zone->lock instead, because that function may call move_freepages() via
    setup_zone_migrate_reserve().

    Signed-off-by: Gerald Schaefer
    Acked-by: KAMEZAWA Hiroyuki
    Tested-by: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • Presently hugepage doesn't use zero page at all because zero page is only
    used for coredumping and hugepage can't core dump.

    However we have now implemented hugepage coredumping. Therefore we should
    implement the zero page of hugepage.

    Implementation note:

    o Why do we only check VM_SHARED for zero page?
    normal page checked as ..

    static inline int use_zero_page(struct vm_area_struct *vma)
    {
    if (vma->vm_flags & (VM_LOCKED | VM_SHARED))
    return 0;

    return !vma->vm_ops || !vma->vm_ops->fault;
    }

    First, hugepages are never mlock()ed. We aren't concerned with VM_LOCKED.

    Second, hugetlbfs is a pseudo filesystem, not a real filesystem and it
    doesn't have any file backing. Thus ops->fault checking is meaningless.

    o Why don't we use zero page if !pte.

    !pte indicate {pud, pmd} doesn't exist or some error happened. So we
    shouldn't return zero page if any error occurred.

    Signed-off-by: KOSAKI Motohiro
    Cc: Adam Litke
    Cc: Hugh Dickins
    Cc: Kawai Hidehiro
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Presently hugepage's vma has a VM_RESERVED flag in order not to be
    swapped. But a VM_RESERVED vma isn't core dumped because this flag is
    often used for some kernel vmas (e.g. vmalloc, sound related).

    Thus hugepages are never dumped and it can't be debugged easily. Many
    developers want hugepages to be included into core-dump.

    However, We can't read generic VM_RESERVED area because this area is often
    IO mapping area. then these area reading may change device state. it is
    definitly undesiable side-effect.

    So adding a hugepage specific bit to the coredump filter is better. It
    will be able to hugepage core dumping and doesn't cause any side-effect to
    any i/o devices.

    In additional, libhugetlb use hugetlb private mapping pages as anonymous
    page. Then, hugepage private mapping pages should be core dumped by
    default.

    Then, /proc/[pid]/core_dump_filter has two new bits.

    - bit 5 mean hugetlb private mapping pages are dumped or not. (default: yes)
    - bit 6 mean hugetlb shared mapping pages are dumped or not. (default: no)

    I tested by following method.

    % ulimit -c unlimited
    % ./crash_hugepage 50
    % ./crash_hugepage 50 -p
    % ls -lh
    % gdb ./crash_hugepage core
    %
    % echo 0x43 > /proc/self/coredump_filter
    % ./crash_hugepage 50
    % ./crash_hugepage 50 -p
    % ls -lh
    % gdb ./crash_hugepage core

    #include
    #include
    #include
    #include
    #include

    #include "hugetlbfs.h"

    int main(int argc, char** argv){
    char* p;
    int ch;
    int mmap_flags = MAP_SHARED;
    int fd;
    int nr_pages;

    while((ch = getopt(argc, argv, "p")) != -1) {
    switch (ch) {
    case 'p':
    mmap_flags &= ~MAP_SHARED;
    mmap_flags |= MAP_PRIVATE;
    break;
    default:
    /* nothing*/
    break;
    }
    }
    argc -= optind;
    argv += optind;

    if (argc == 0){
    printf("need # of pages\n");
    exit(1);
    }

    nr_pages = atoi(argv[0]);
    if (nr_pages < 2) {
    printf("nr_pages must >2\n");
    exit(1);
    }

    fd = hugetlbfs_unlinked_fd();
    p = mmap(NULL, nr_pages * gethugepagesize(),
    PROT_READ|PROT_WRITE, mmap_flags, fd, 0);

    sleep(2);

    *(p + gethugepagesize()) = 1; /* COW */
    sleep(2);

    /* crash! */
    *(int*)0 = 1;

    return 0;
    }

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Kawai Hidehiro
    Cc: Hugh Dickins
    Cc: William Irwin
    Cc: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Improve debuggability of memory setup problems.

    Signed-off-by: Yinghai Lu
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • mm/hugetlb.c:265:17: warning: symbol 'resv_map_alloc' was not declared. Should it be static?
    mm/hugetlb.c:277:6: warning: symbol 'resv_map_release' was not declared. Should it be static?
    mm/hugetlb.c:292:9: warning: Using plain integer as NULL pointer
    mm/hugetlb.c:1750:5: warning: symbol 'unmap_ref_private' was not declared. Should it be static?

    Signed-off-by: Harvey Harrison
    Acked-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     
  • Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
    provide a fast, scalable percpu frontend for small vmaps (requires a
    slightly different API, though).

    The biggest problem with vmap is actually vunmap. Presently this requires
    a global kernel TLB flush, which on most architectures is a broadcast IPI
    to all CPUs to flush the cache. This is all done under a global lock. As
    the number of CPUs increases, so will the number of vunmaps a scaled
    workload will want to perform, and so will the cost of a global TLB flush.
    This gives terrible quadratic scalability characteristics.

    Another problem is that the entire vmap subsystem works under a single
    lock. It is a rwlock, but it is actually taken for write in all the fast
    paths, and the read locking would likely never be run concurrently anyway,
    so it's just pointless.

    This is a rewrite of vmap subsystem to solve those problems. The existing
    vmalloc API is implemented on top of the rewritten subsystem.

    The TLB flushing problem is solved by using lazy TLB unmapping. vmap
    addresses do not have to be flushed immediately when they are vunmapped,
    because the kernel will not reuse them again (would be a use-after-free)
    until they are reallocated. So the addresses aren't allocated again until
    a subsequent TLB flush. A single TLB flush then can flush multiple
    vunmaps from each CPU.

    XEN and PAT and such do not like deferred TLB flushing because they can't
    always handle multiple aliasing virtual addresses to a physical address.
    They now call vm_unmap_aliases() in order to flush any deferred mappings.
    That call is very expensive (well, actually not a lot more expensive than
    a single vunmap under the old scheme), however it should be OK if not
    called too often.

    The virtual memory extent information is stored in an rbtree rather than a
    linked list to improve the algorithmic scalability.

    There is a per-CPU allocator for small vmaps, which amortizes or avoids
    global locking.

    To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
    must be used in place of vmap and vunmap. Vmalloc does not use these
    interfaces at the moment, so it will not be quite so scalable (although it
    will use lazy TLB flushing).

    As a quick test of performance, I ran a test that loops in the kernel,
    linearly mapping then touching then unmapping 4 pages. Different numbers
    of tests were run in parallel on an 4 core, 2 socket opteron. Results are
    in nanoseconds per map+touch+unmap.

    threads vanilla vmap rewrite
    1 14700 2900
    2 33600 3000
    4 49500 2800
    8 70631 2900

    So with a 8 cores, the rewritten version is already 25x faster.

    In a slightly more realistic test (although with an older and less
    scalable version of the patch), I ripped the not-very-good vunmap batching
    code out of XFS, and implemented the large buffer mapping with vm_map_ram
    and vm_unmap_ram... along with a couple of other tricks, I was able to
    speed up a large directory workload by 20x on a 64 CPU system. I believe
    vmap/vunmap is actually sped up a lot more than 20x on such a system, but
    I'm running into other locks now. vmap is pretty well blown off the
    profiles.

    Before:
    1352059 total 0.1401
    798784 _write_lock 8320.6667
    Cc: Jeremy Fitzhardinge
    Cc: Krzysztof Helt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • __vma_link_file and expand_downwards functions are not small, yeat they
    are marked inline. They probably had one callsite sometime in the past,
    but now they have more. In order to prevent similar thing, I also
    deinlined expand_upwards, despite it having only pne callsite. Nowadays
    gcc auto-inlines such static functions anyway. In find_extend_vma, I
    removed one extra level of indirection.

    Patch is deliberately generated with -U $BIGNUM to make
    it easier to see that functions are big.

    Result:

    # size */*/mmap.o */vmlinux
    text data bss dec hex filename
    9514 188 16 9718 25f6 0.org/mm/mmap.o
    9237 188 16 9441 24e1 deinline/mm/mmap.o
    6124402 858996 389480 7372878 70804e 0.org/vmlinux
    6124113 858996 389480 7372589 707f2d deinline/vmlinux

    Signed-off-by: Denys Vlasenko
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denys Vlasenko
     
  • trylock_buffer and unlock_buffer open and close a critical section.
    Hence, we can use the lock bitops to get the desired memory ordering.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • trylock_page, unlock_page open and close a critical section. Hence,
    we can use the lock bitops to get the desired memory ordering.

    Also, mark trylock as likely to succeed (and remove the annotation from
    callers).

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • unlock_page is fairly expensive. It can be avoided in page reclaim
    success path. By definition if we have any other references to the page
    it would be a bug anyway.

    Signed-off-by: Nick Piggin
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Setting and clearing the page locked when inserting it into swapcache /
    pagecache when it has no other references can use non-atomic page flags
    operations because no other CPU may be operating on it at this time.

    This saves one atomic operation when inserting a page into pagecache.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Rework Posix error return for mlock().

    Posix requires error code for mlock*() system calls for some conditions
    that differ from what kernel low level functions, such as
    get_user_pages(), return for those conditions. For more info, see:

    http://marc.info/?l=linux-kernel&m=121750892930775&w=2

    This patch provides the same translation of get_user_pages()
    error codes to posix specified error codes in the context
    of the mlock rework for unevictable lru.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • This change is intended to make mlock() error returns correct.
    make_page_present() is a lower level function used by more than mlock().
    Subsequent patch[es] will add this error return fixup in an mlock specific
    path.

    Cc: KOSAKI Motohiro
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • During each reclaim scan we accumulate scan pressure on unrelated lists
    which will result in bogus scans and unwanted reclaims eventually.

    Scanning lists with few reclaim candidates results in a lot of rotation
    and therefor also disturbs the list balancing, putting even more
    pressure on the wrong lists.

    In a test-case with much streaming IO, and therefor a crowded inactive
    file page list, swapping started because

    a) anon pages were reclaimed after swap_cluster_max reclaim
    invocations -- nr_scan of this list has just accumulated

    b) active file pages were scanned because *their* nr_scan has also
    accumulated through the same logic. And this in return created a
    lot of rotation for file pages and resulted in a decrease of file
    list priority, again increasing the pressure on anon pages.

    The result was an evicted working set of anon pages while there were
    tons of inactive file pages that should have been taken instead.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Several LRU manupuration function are not used now. So they can be
    removed.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Allow free of mlock()ed pages. This shouldn't happen, but during
    developement, it occasionally did.

    This patch allows us to survive that condition, while keeping the
    statistics and events correct for debug.

    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • This patch adds a function to scan individual or all zones' unevictable
    lists and move any pages that have become evictable onto the respective
    zone's inactive list, where shrink_inactive_list() will deal with them.

    Adds sysctl to scan all nodes, and per node attributes to individual
    nodes' zones.

    Kosaki: If evictable page found in unevictable lru when write
    /proc/sys/vm/scan_unevictable_pages, print filename and file offset of
    these pages.

    [akpm@linux-foundation.org: fix one CONFIG_MMU=n build error]
    [kosaki.motohiro@jp.fujitsu.com: adapt vmscan-unevictable-lru-scan-sysctl.patch to new sysfs API]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • In the fault paths that install new anonymous pages, check whether the
    page is evictable or not using lru_cache_add_active_or_unevictable(). If
    the page is evictable, just add it to the active lru list [via the pagevec
    cache], else add it to the unevictable list.

    This "proactive" culling in the fault path mimics the handling of mlocked
    pages in Nick Piggin's series to keep mlocked pages off the lru lists.

    Notes:

    1) This patch is optional--e.g., if one is concerned about the
    additional test in the fault path. We can defer the moving of
    nonreclaimable pages until when vmscan [shrink_*_list()]
    encounters them. Vmscan will only need to handle such pages
    once, but if there are a lot of them it could impact system
    performance.

    2) The 'vma' argument to page_evictable() is require to notice that
    we're faulting a page into an mlock()ed vma w/o having to scan the
    page's rmap in the fault path. Culling mlock()ed anon pages is
    currently the only reason for this patch.

    3) We can't cull swap pages in read_swap_cache_async() because the
    vma argument doesn't necessarily correspond to the swap cache
    offset passed in by swapin_readahead(). This could [did!] result
    in mlocking pages in non-VM_LOCKED vmas if [when] we tried to
    cull in this path.

    4) Move set_pte_at() to after where we add page to lru to keep it
    hidden from other tasks that might walk the page table.
    We already do it in this order in do_anonymous() page. And,
    these are COW'd anon pages. Is this safe?

    [riel@redhat.com: undo an overzealous code cleanup]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Add NR_MLOCK zone page state, which provides a (conservative) count of
    mlocked pages (actually, the number of mlocked pages moved off the LRU).

    Reworked by lts to fit in with the modified mlock page support in the
    Reclaim Scalability series.

    [kosaki.motohiro@jp.fujitsu.com: fix incorrect Mlocked field of /proc/meminfo]
    [lee.schermerhorn@hp.com: mlocked-pages: add event counting with statistics]
    Signed-off-by: Nick Piggin
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Originally by Nick Piggin

    Remove mlocked pages from the LRU using "unevictable infrastructure"
    during mmap(), munmap(), mremap() and truncate(). Try to move back to
    normal LRU lists on munmap() when last mlocked mapping removed. Remove
    PageMlocked() status when page truncated from file.

    [akpm@linux-foundation.org: cleanup]
    [kamezawa.hiroyu@jp.fujitsu.com: fix double unlock_page()]
    [kosaki.motohiro@jp.fujitsu.com: split LRU: munlock rework]
    [lee.schermerhorn@hp.com: mlock: fix __mlock_vma_pages_range comment block]
    [akpm@linux-foundation.org: remove bogus kerneldoc token]
    Signed-off-by: Nick Piggin
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • We need to hold the mmap_sem for write to initiatate mlock()/munlock()
    because we may need to merge/split vmas. However, this can lead to very
    long lock hold times attempting to fault in a large memory region to mlock
    it into memory. This can hold off other faults against the mm
    [multithreaded tasks] and other scans of the mm, such as via /proc. To
    alleviate this, downgrade the mmap_sem to read mode during the population
    of the region for locking. This is especially the case if we need to
    reclaim memory to lock down the region. We [probably?] don't need to do
    this for unlocking as all of the pages should be resident--they're already
    mlocked.

    Now, the caller's of the mlock functions [mlock_fixup() and
    mlock_vma_pages_range()] expect the mmap_sem to be returned in write mode.
    Changing all callers appears to be way too much effort at this point.
    So, restore write mode before returning. Note that this opens a window
    where the mmap list could change in a multithreaded process. So, at least
    for mlock_fixup(), where we could be called in a loop over multiple vmas,
    we check that a vma still exists at the start address and that vma still
    covers the page range [start,end). If not, we return an error, -EAGAIN,
    and let the caller deal with it.

    Return -EAGAIN from mlock_vma_pages_range() function and mlock_fixup() if
    the vma at 'start' disappears or changes so that the page range
    [start,end) is no longer contained in the vma. Again, let the caller deal
    with it. Looks like only sys_remap_file_pages() [via mmap_region()]
    should actually care.

    With this patch, I no longer see processes like ps(1) blocked for seconds
    or minutes at a time waiting for a large [multiple gigabyte] region to be
    locked down. However, I occassionally see delays while unlocking or
    unmapping a large mlocked region. Should we also downgrade the mmap_sem
    for the unlock path?

    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Documentation for unevictable lru list and its usage.

    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Make sure that mlocked pages also live on the unevictable LRU, so kswapd
    will not scan them over and over again.

    This is achieved through various strategies:

    1) add yet another page flag--PG_mlocked--to indicate that
    the page is locked for efficient testing in vmscan and,
    optionally, fault path. This allows early culling of
    unevictable pages, preventing them from getting to
    page_referenced()/try_to_unmap(). Also allows separate
    accounting of mlock'd pages, as Nick's original patch
    did.

    Note: Nick's original mlock patch used a PG_mlocked
    flag. I had removed this in favor of the PG_unevictable
    flag + an mlock_count [new page struct member]. I
    restored the PG_mlocked flag to eliminate the new
    count field.

    2) add the mlock/unevictable infrastructure to mm/mlock.c,
    with internal APIs in mm/internal.h. This is a rework
    of Nick's original patch to these files, taking into
    account that mlocked pages are now kept on unevictable
    LRU list.

    3) update vmscan.c:page_evictable() to check PageMlocked()
    and, if vma passed in, the vm_flags. Note that the vma
    will only be passed in for new pages in the fault path;
    and then only if the "cull unevictable pages in fault
    path" patch is included.

    4) add try_to_unlock() to rmap.c to walk a page's rmap and
    ClearPageMlocked() if no other vmas have it mlocked.
    Reuses as much of try_to_unmap() as possible. This
    effectively replaces the use of one of the lru list links
    as an mlock count. If this mechanism let's pages in mlocked
    vmas leak through w/o PG_mlocked set [I don't know that it
    does], we should catch them later in try_to_unmap(). One
    hopes this will be rare, as it will be relatively expensive.

    Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
    Signed-off-by: Nick Piggin

    splitlru: introduce __get_user_pages():

    New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
    because current get_user_pages() can't grab PROT_NONE pages theresore it
    cause PROT_NONE pages can't munlock.

    [akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
    [akpm@linux-foundation.org: untangle patch interdependencies]
    [akpm@linux-foundation.org: fix things after out-of-order merging]
    [hugh@veritas.com: fix page-flags mess]
    [lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
    [kosaki.motohiro@jp.fujitsu.com: build fix]
    [kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
    [kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Matt Mackall
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Shmem segments locked into memory via shmctl(SHM_LOCKED) should not be
    kept on the normal LRU, since scanning them is a waste of time and might
    throw off kswapd's balancing algorithms. Place them on the unevictable
    LRU list instead.

    Use the AS_UNEVICTABLE flag to mark address_space of SHM_LOCKed shared
    memory regions as unevictable. Then these pages will be culled off the
    normal LRU lists during vmscan.

    Add new wrapper function to clear the mapping's unevictable state when/if
    shared memory segment is munlocked.

    Add 'scan_mapping_unevictable_page()' to mm/vmscan.c to scan all pages in
    the shmem segment's mapping [struct address_space] for evictability now
    that they're no longer locked. If so, move them to the appropriate zone
    lru list.

    Changes depend on [CONFIG_]UNEVICTABLE_LRU.

    [kosaki.motohiro@jp.fujitsu.com: revert shm change]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: Kosaki Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Christoph Lameter pointed out that ram disk pages also clutter the LRU
    lists. When vmscan finds them dirty and tries to clean them, the ram disk
    writeback function just redirties the page so that it goes back onto the
    active list. Round and round she goes...

    With the ram disk driver [rd.c] replaced by the newer 'brd.c', this is no
    longer the case, as ram disk pages are no longer maintained on the lru.
    [This makes them unmigratable for defrag or memory hot remove, but that
    can be addressed by a separate patch series.] However, the ramfs pages
    behave like ram disk pages used to, so:

    Define new address_space flag [shares address_space flags member with
    mapping's gfp mask] to indicate that the address space contains all
    unevictable pages. This will provide for efficient testing of ramfs pages
    in page_evictable().

    Also provide wrapper functions to set/test the unevictable state to
    minimize #ifdefs in ramfs driver and any other users of this facility.

    Set the unevictable state on address_space structures for new ramfs
    inodes. Test the unevictable state in page_evictable() to cull
    unevictable pages.

    These changes depend on [CONFIG_]UNEVICTABLE_LRU.

    [riel@redhat.com: undo the brd.c part]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Debugged-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Report unevictable pages per zone and system wide.

    Kosaki Motohiro added support for memory controller unevictable
    statistics.

    [riel@redhat.com: fix printk in show_free_areas()]
    [akpm@linux-foundation.org: fix units in /proc/vmstats]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Debugged-by: Hiroshi Shimamoto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Fix to unevictable-lru-page-statistics.patch

    Add unevictable lru infrastructure vm events to the statistics patch.
    Rename the "NORECL_" and "noreclaim_" symbols and text strings to
    "UNEVICTABLE_" and "unevictable_", respectively.

    Currently, both the infrastructure and the mlocked pages event are
    added by a single patch later in the series. This makes it difficult
    to add or rework the incremental patches. The events actually "belong"
    with the stats, so pull them up to here.

    Also, restore the event counting to putback_lru_page(). This was removed
    from previous patch in series where it was "misplaced". The actual events
    weren't defined that early.

    Signed-off-by: Lee Schermerhorn
    Cc: Rik van Riel
    Reviewed-by: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • When the system contains lots of mlocked or otherwise unevictable pages,
    the pageout code (kswapd) can spend lots of time scanning over these
    pages. Worse still, the presence of lots of unevictable pages can confuse
    kswapd into thinking that more aggressive pageout modes are required,
    resulting in all kinds of bad behaviour.

    Infrastructure to manage pages excluded from reclaim--i.e., hidden from
    vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked to
    maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
    them from vmscan.

    Kosaki Motohiro added the support for the memory controller unevictable
    lru list.

    Pages on the unevictable list have both PG_unevictable and PG_lru set.
    Thus, PG_unevictable is analogous to and mutually exclusive with
    PG_active--it specifies which LRU list the page is on.

    The unevictable infrastructure is enabled by a new mm Kconfig option
    [CONFIG_]UNEVICTABLE_LRU.

    A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
    not a page may be evictable. Subsequent patches will add the various
    !evictable tests. We'll want to keep these tests light-weight for use in
    shrink_active_list() and, possibly, the fault path.

    To avoid races between tasks putting pages [back] onto an LRU list and
    tasks that might be moving the page from non-evictable to evictable state,
    the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
    -- tests the "evictability" of a page after placing it on the LRU, before
    dropping the reference. If the page has become unevictable,
    putback_lru_page() will redo the 'putback', thus moving the page to the
    unevictable list. This way, we avoid "stranding" evictable pages on the
    unevictable list.

    [akpm@linux-foundation.org: fix fallout from out-of-order merge]
    [riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
    [nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
    [kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
    [kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
    [kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
    [kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
    [kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Debugged-by: Benjamin Kidwell
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Define proper false/noop inline functions for noreclaim page flags when
    !defined(CONFIG_UNEVICTABLE_LRU)

    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • During an AIM7 run on a 16GB system, fork started failing around 32000
    threads, despite the system having plenty of free swap and 15GB of
    pageable memory. This was on x86-64, so 8k stacks.

    If a higher order allocation fails, we can either:
    - keep evicting pages off the end of the LRUs and hope that
    we eventually create a contiguous region; this is somewhat
    unlikely if the system is under enough stress by new
    allocations
    - after trying normal eviction for a bit, use lumpy reclaim

    This patch switches the system to lumpy reclaim if the VM is having
    trouble freeing enough pages, using the same threshold for detection as
    used by pageout congestion wait.

    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Swapin_readahead can read in a lot of data that the processes in memory
    never need. Adding swap cache pages to the inactive list prevents them
    from putting too much pressure on the working set.

    This has the potential to help the programs that are already in memory,
    but it could also be a disadvantage to processes that are trying to get
    swapped in.

    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel