07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

03 Nov, 2011

1 commit

  • Michel while working on the working set estimation code, noticed that
    calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
    wasn't safe, if the pfn ended up being a tail page of a transparent
    hugepage under splitting by __split_huge_page_refcount().

    He then found the problem could also theoretically materialize with
    page_cache_get_speculative() during the speculative radix tree lookups
    that uses get_page_unless_zero() in SMP if the radix tree page is freed
    and reallocated and get_user_pages is called on it before
    page_cache_get_speculative has a chance to call get_page_unless_zero().

    So the best way to fix the problem is to keep page_tail->_count zero at
    all times. This will guarantee that get_page_unless_zero() can never
    succeed on any tail page. page_tail->_mapcount is guaranteed zero and
    is unused for all tail pages of a compound page, so we can simply
    account the tail page references there and transfer them to
    tail_page->_count in __split_huge_page_refcount() (in addition to the
    head_page->_mapcount).

    While debugging this s/_count/_mapcount/ change I also noticed get_page is
    called by direct-io.c on pages returned by get_user_pages. That wasn't
    entirely safe because the two atomic_inc in get_page weren't atomic. As
    opposed to other get_user_page users like secondary-MMU page fault to
    establish the shadow pagetables would never call any superflous get_page
    after get_user_page returns. It's safer to make get_page universally safe
    for tail pages and to use get_page_foll() within follow_page (inside
    get_user_pages()). get_page_foll() is safe to do the refcounting for tail
    pages without taking any locks because it is run within PT lock protected
    critical sections (PT lock for pte and page_table_lock for
    pmd_trans_huge).

    The standard get_page() as invoked by direct-io instead will now take
    the compound_lock but still only for tail pages. The direct-io paths
    are usually I/O bound and the compound_lock is per THP so very
    finegrined, so there's no risk of scalability issues with it. A simple
    direct-io benchmarks with all lockdep prove locking and spinlock
    debugging infrastructure enabled shows identical performance and no
    overhead. So it's worth it. Ideally direct-io should stop calling
    get_page() on pages returned by get_user_pages(). The spinlock in
    get_page() is already optimized away for no-THP builds but doing
    get_page() on tail pages returned by GUP is generally a rare operation
    and usually only run in I/O paths.

    This new refcounting on page_tail->_mapcount in addition to avoiding new
    RCU critical sections will also allow the working set estimation code to
    work without any further complexity associated to the tail page
    refcounting with THP.

    Signed-off-by: Andrea Arcangeli
    Reported-by: Michel Lespinasse
    Reviewed-by: Michel Lespinasse
    Reviewed-by: Minchan Kim
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Benjamin Herrenschmidt
    Cc: David Gibson
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

31 Oct, 2011

1 commit


25 May, 2011

2 commits

  • The zone->lru_lock is heavily contented in workload where activate_page()
    is frequently used. We could do batch activate_page() to reduce the lock
    contention. The batched pages will be added into zone list when the pool
    is full or page reclaim is trying to drain them.

    For example, in a 4 socket 64 CPU system, create a sparse file and 64
    processes, processes shared map to the file. Each process read access the
    whole file and then exit. The process exit will do unmap_vmas() and cause
    a lot of activate_page() call. In such workload, we saw about 58% total
    time reduction with below patch. Other workloads with a lot of
    activate_page also benefits a lot too.

    Andrew Morton suggested activate_page() and putback_lru_pages() should
    follow the same path to active pages, but this is hard to implement (see
    commit 7a608572a282a ("Revert "mm: batch activate_page() to reduce lock
    contention")). On the other hand, do we really need putback_lru_pages()
    to follow the same path? I tested several FIO/FFSB benchmark (about 20
    scripts for each benchmark) in 3 machines here from 2 sockets to 4
    sockets. My test doesn't show anything significant with/without below
    patch (there is slight difference but mostly some noise which we found
    even without below patch before). Below patch basically returns to the
    same as my first post.

    I tested some microbenchmarks:
    case-anon-cow-rand-mt 0.58%
    case-anon-cow-rand -3.30%
    case-anon-cow-seq-mt -0.51%
    case-anon-cow-seq -5.68%
    case-anon-r-rand-mt 0.23%
    case-anon-r-rand 0.81%
    case-anon-r-seq-mt -0.71%
    case-anon-r-seq -1.99%
    case-anon-rx-rand-mt 2.11%
    case-anon-rx-seq-mt 3.46%
    case-anon-w-rand-mt -0.03%
    case-anon-w-rand -0.50%
    case-anon-w-seq-mt -1.08%
    case-anon-w-seq -0.12%
    case-anon-wx-rand-mt -5.02%
    case-anon-wx-seq-mt -1.43%
    case-fork 1.65%
    case-fork-sleep -0.07%
    case-fork-withmem 1.39%
    case-hugetlb -0.59%
    case-lru-file-mmap-read-mt -0.54%
    case-lru-file-mmap-read 0.61%
    case-lru-file-mmap-read-rand -2.24%
    case-lru-file-readonce -0.64%
    case-lru-file-readtwice -11.69%
    case-lru-memcg -1.35%
    case-mmap-pread-rand-mt 1.88%
    case-mmap-pread-rand -15.26%
    case-mmap-pread-seq-mt 0.89%
    case-mmap-pread-seq -69.72%
    case-mmap-xread-rand-mt 0.71%
    case-mmap-xread-seq-mt 0.38%

    The most significent are:
    case-lru-file-readtwice -11.69%
    case-mmap-pread-rand -15.26%
    case-mmap-pread-seq -69.72%

    which use activate_page a lot. others are basically variations because
    each run has slightly difference.

    In UP case, 'size mm/swap.o'
    before the two patches:
    text data bss dec hex filename
    6466 896 4 7366 1cc6 mm/swap.o
    after the two patches:
    text data bss dec hex filename
    6343 896 4 7243 1c4b mm/swap.o

    Signed-off-by: Shaohua Li
    Cc: KOSAKI Motohiro
    Cc: Hiroyuki Kamezawa
    Cc: Andi Kleen
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • It's pointless that deactive_page's operates on unevictable pages. This
    patch removes unnecessary overhead which might be a bit problem in case
    that there are many unevictable page in system(ex, mprotect workload)

    [akpm@linux-foundation.org: tidy up comment]
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

12 May, 2011

1 commit

  • The lru_deactivate_fn should not move page which in on unevictable lru
    into inactive list. Otherwise, we can meet BUG when we use
    isolate_lru_pages as __isolate_lru_page could return -EINVAL.

    Reported-by: Ying Han
    Tested-by: Ying Han
    Signed-off-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

23 Mar, 2011

4 commits

  • Clean up code and remove duplicate code. Next patch will use
    pagevec_lru_move_fn introduced here too.

    Signed-off-by: Shaohua Li
    Cc: KOSAKI Motohiro
    Cc: Hiroyuki Kamezawa
    Cc: Andi Kleen
    Reviewed-by: Minchan Kim
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • invalidate_mapping_pages is very big hint to reclaimer. It means user
    doesn't want to use the page any more. So in order to prevent working set
    page eviction, this patch move the page into tail of inactive list by
    PG_reclaim.

    Please, remember that pages in inactive list are working set as well as
    active list. If we don't move pages into inactive list's tail, pages near
    by tail of inactive list can be evicted although we have a big clue about
    useless pages. It's totally bad.

    Now PG_readahead/PG_reclaim is shared. fe3cba17 added ClearPageReclaim
    into clear_page_dirty_for_io for preventing fast reclaiming readahead
    marker page.

    In this series, PG_reclaim is used by invalidated page, too. If VM find
    the page is invalidated and it's dirty, it sets PG_reclaim to reclaim
    asap. Then, when the dirty page will be writeback,
    clear_page_dirty_for_io will clear PG_reclaim unconditionally. It
    disturbs this serie's goal.

    I think it's okay to clear PG_readahead when the page is dirty, not
    writeback time. So this patch moves ClearPageReadahead. In v4,
    ClearPageReadahead in set_page_dirty has a problem which is reported by
    Steven Barrett. It's due to compound page. Some driver(ex, audio) calls
    set_page_dirty with compound page which isn't on LRU. but my patch does
    ClearPageRelcaim on compound page. In non-CONFIG_PAGEFLAGS_EXTENDED, it
    breaks PageTail flag.

    I think it doesn't affect THP and pass my test with THP enabling but Cced
    Andrea for double check.

    Signed-off-by: Minchan Kim
    Reported-by: Steven Barrett
    Reviewed-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: Wu Fengguang
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • The rotate_reclaimable_page function moves just written out pages, which
    the VM wanted to reclaim, to the end of the inactive list. That way the
    VM will find those pages first next time it needs to free memory.

    This patch applies the rule in memcg. It can help to prevent unnecessary
    working page eviction of memcg.

    Signed-off-by: Minchan Kim
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Acked-by: Johannes Weiner
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Recently, there are reported problem about thrashing.
    (http://marc.info/?l=rsync&m=128885034930933&w=2) It happens by backup
    workloads(ex, nightly rsync). That's because the workload makes just
    use-once pages and touches pages twice. It promotes the page into active
    list so that it results in working set page eviction.

    Some app developer want to support POSIX_FADV_NOREUSE. But other OSes
    don't support it, either.
    (http://marc.info/?l=linux-mm&m=128928979512086&w=2)

    By other approach, app developers use POSIX_FADV_DONTNEED. But it has a
    problem. If kernel meets page is writing during invalidate_mapping_pages,
    it can't work. It makes for application programmer to use it since they
    always have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
    make sure the pages could be discardable. At last, they can't use
    deferred write of kernel so that they could see performance loss.
    (http://insights.oetiker.ch/linux/fadvise.html)

    In fact, invalidation is very big hint to reclaimer. It means we don't
    use the page any more. So let's move the writing page into inactive
    list's head if we can't truncate it right now.

    Why I move page to head of lru on this patch, Dirty/Writeback page would
    be flushed sooner or later. It can prevent writeout of pageout which is
    less effective than flusher's writeout.

    Originally, I reused lru_demote of Peter with some change so added his
    Signed-off-by.

    Signed-off-by: Minchan Kim
    Reported-by: Ben Gamari
    Signed-off-by: Peter Zijlstra
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Reviewed-by: KOSAKI Motohiro
    Cc: Wu Fengguang
    Acked-by: Johannes Weiner
    Cc: Nick Piggin
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

18 Jan, 2011

2 commits

  • This reverts commit d8505dee1a87b8d41b9c4ee1325cd72258226fbc.

    Chris Mason ended up chasing down some page allocation errors and pages
    stuck waiting on the IO scheduler, and was able to narrow it down to two
    commits: commit 744ed1442757 ("mm: batch activate_page() to reduce lock
    contention") and d8505dee1a87 ("mm: simplify code of swap.c").

    This reverts the second one.

    Reported-and-debugged-by: Chris Mason
    Cc: Mel Gorman
    Cc: Andrew Morton
    Cc: Jens Axboe
    Cc: linux-mm
    Cc: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Cc: Shaohua Li
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • This reverts commit 744ed1442757767ffede5008bb13e0805085902e.

    Chris Mason ended up chasing down some page allocation errors and pages
    stuck waiting on the IO scheduler, and was able to narrow it down to two
    commits: commit 744ed1442757 ("mm: batch activate_page() to reduce lock
    contention") and d8505dee1a87 ("mm: simplify code of swap.c").

    This reverts the first of them.

    Reported-and-debugged-by: Chris Mason
    Cc: Mel Gorman
    Cc: Andrew Morton
    Cc: Jens Axboe
    Cc: linux-mm
    Cc: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Cc: Shaohua Li
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

14 Jan, 2011

5 commits

  • The zone->lru_lock is heavily contented in workload where activate_page()
    is frequently used. We could do batch activate_page() to reduce the lock
    contention. The batched pages will be added into zone list when the pool
    is full or page reclaim is trying to drain them.

    For example, in a 4 socket 64 CPU system, create a sparse file and 64
    processes, processes shared map to the file. Each process read access the
    whole file and then exit. The process exit will do unmap_vmas() and cause
    a lot of activate_page() call. In such workload, we saw about 58% total
    time reduction with below patch. Other workloads with a lot of
    activate_page also benefits a lot too.

    I tested some microbenchmarks:
    case-anon-cow-rand-mt 0.58%
    case-anon-cow-rand -3.30%
    case-anon-cow-seq-mt -0.51%
    case-anon-cow-seq -5.68%
    case-anon-r-rand-mt 0.23%
    case-anon-r-rand 0.81%
    case-anon-r-seq-mt -0.71%
    case-anon-r-seq -1.99%
    case-anon-rx-rand-mt 2.11%
    case-anon-rx-seq-mt 3.46%
    case-anon-w-rand-mt -0.03%
    case-anon-w-rand -0.50%
    case-anon-w-seq-mt -1.08%
    case-anon-w-seq -0.12%
    case-anon-wx-rand-mt -5.02%
    case-anon-wx-seq-mt -1.43%
    case-fork 1.65%
    case-fork-sleep -0.07%
    case-fork-withmem 1.39%
    case-hugetlb -0.59%
    case-lru-file-mmap-read-mt -0.54%
    case-lru-file-mmap-read 0.61%
    case-lru-file-mmap-read-rand -2.24%
    case-lru-file-readonce -0.64%
    case-lru-file-readtwice -11.69%
    case-lru-memcg -1.35%
    case-mmap-pread-rand-mt 1.88%
    case-mmap-pread-rand -15.26%
    case-mmap-pread-seq-mt 0.89%
    case-mmap-pread-seq -69.72%
    case-mmap-xread-rand-mt 0.71%
    case-mmap-xread-seq-mt 0.38%

    The most significent are:
    case-lru-file-readtwice -11.69%
    case-mmap-pread-rand -15.26%
    case-mmap-pread-seq -69.72%

    which use activate_page a lot. others are basically variations because
    each run has slightly difference.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Shaohua Li
    Cc: Andi Kleen
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Clean up code and remove duplicate code. Next patch will use
    pagevec_lru_move_fn introduced here too.

    Signed-off-by: Shaohua Li
    Cc: Andi Kleen
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Lately I've been working to make KVM use hugepages transparently without
    the usual restrictions of hugetlbfs. Some of the restrictions I'd like to
    see removed:

    1) hugepages have to be swappable or the guest physical memory remains
    locked in RAM and can't be paged out to swap

    2) if a hugepage allocation fails, regular pages should be allocated
    instead and mixed in the same vma without any failure and without
    userland noticing

    3) if some task quits and more hugepages become available in the
    buddy, guest physical memory backed by regular pages should be
    relocated on hugepages automatically in regions under
    madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
    kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
    not null)

    4) avoidance of reservation and maximization of use of hugepages whenever
    possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
    1 machine with 1 database with 1 database cache with 1 database cache size
    known at boot time. It's definitely not feasible with a virtualization
    hypervisor usage like RHEV-H that runs an unknown number of virtual machines
    with an unknown size of each virtual machine with an unknown amount of
    pagecache that could be potentially useful in the host for guest not using
    O_DIRECT (aka cache=off).

    hugepages in the virtualization hypervisor (and also in the guest!) are
    much more important than in a regular host not using virtualization,
    becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24
    to 19 in case only the hypervisor uses transparent hugepages, and they
    decrease the tlb-miss cacheline accesses from 19 to 15 in case both the
    linux hypervisor and the linux guest both uses this patch (though the
    guest will limit the addition speedup to anonymous regions only for
    now...). Even more important is that the tlb miss handler is much slower
    on a NPT/EPT guest than for a regular shadow paging or no-virtualization
    scenario. So maximizing the amount of virtual memory cached by the TLB
    pays off significantly more with NPT/EPT than without (even if there would
    be no significant speedup in the tlb-miss runtime).

    The first (and more tedious) part of this work requires allowing the VM to
    handle anonymous hugepages mixed with regular pages transparently on
    regular anonymous vmas. This is what this patch tries to achieve in the
    least intrusive possible way. We want hugepages and hugetlb to be used in
    a way so that all applications can benefit without changes (as usual we
    leverage the KVM virtualization design: by improving the Linux VM at
    large, KVM gets the performance boost too).

    The most important design choice is: always fallback to 4k allocation if
    the hugepage allocation fails! This is the _very_ opposite of some large
    pagecache patches that failed with -EIO back then if a 64k (or similar)
    allocation failed...

    Second important decision (to reduce the impact of the feature on the
    existing pagetable handling code) is that at any time we can split an
    hugepage into 512 regular pages and it has to be done with an operation
    that can't fail. This way the reliability of the swapping isn't decreased
    (no need to allocate memory when we are short on memory to swap) and it's
    trivial to plug a split_huge_page* one-liner where needed without
    polluting the VM. Over time we can teach mprotect, mremap and friends to
    handle pmd_trans_huge natively without calling split_huge_page*. The fact
    it can't fail isn't just for swap: if split_huge_page would return -ENOMEM
    (instead of the current void) we'd need to rollback the mprotect from the
    middle of it (ideally including undoing the split_vma) which would be a
    big change and in the very wrong direction (it'd likely be simpler not to
    call split_huge_page at all and to teach mprotect and friends to handle
    hugepages instead of rolling them back from the middle). In short the
    very value of split_huge_page is that it can't fail.

    The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and
    incremental and it'll just be an "harmless" addition later if this initial
    part is agreed upon. It also should be noted that locking-wise replacing
    regular pages with hugepages is going to be very easy if compared to what
    I'm doing below in split_huge_page, as it will only happen when
    page_count(page) matches page_mapcount(page) if we can take the PG_lock
    and mmap_sem in write mode. collapse_huge_page will be a "best effort"
    that (unlike split_huge_page) can fail at the minimal sign of trouble and
    we can try again later. collapse_huge_page will be similar to how KSM
    works and the madvise(MADV_HUGEPAGE) will work similar to
    madvise(MADV_MERGEABLE).

    The default I like is that transparent hugepages are used at page fault
    time. This can be changed with
    /sys/kernel/mm/transparent_hugepage/enabled. The control knob can be set
    to three values "always", "madvise", "never" which mean respectively that
    hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions,
    or never used. /sys/kernel/mm/transparent_hugepage/defrag instead
    controls if the hugepage allocation should defrag memory aggressively
    "always", only inside "madvise" regions, or "never".

    The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
    put_page (from get_user_page users that can't use mmu notifier like
    O_DIRECT) that runs against a __split_huge_page_refcount instead was a
    pain to serialize in a way that would result always in a coherent page
    count for both tail and head. I think my locking solution with a
    compound_lock taken only after the page_first is valid and is still a
    PageHead should be safe but it surely needs review from SMP race point of
    view. In short there is no current existing way to serialize the O_DIRECT
    final put_page against split_huge_page_refcount so I had to invent a new
    one (O_DIRECT loses knowledge on the mapping status by the time gup_fast
    returns so...). And I didn't want to impact all gup/gup_fast users for
    now, maybe if we change the gup interface substantially we can avoid this
    locking, I admit I didn't think too much about it because changing the gup
    unpinning interface would be invasive.

    If we ignored O_DIRECT we could stick to the existing compound refcounting
    code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM
    (and any other mmu notifier user) would call it without FOLL_GET (and if
    FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the
    current task mmu notifier list yet). But O_DIRECT is fundamental for
    decent performance of virtualized I/O on fast storage so we can't avoid it
    to solve the race of put_page against split_huge_page_refcount to achieve
    a complete hugepage feature for KVM.

    Swap and oom works fine (well just like with regular pages ;). MMU
    notifier is handled transparently too, with the exception of the young bit
    on the pmd, that didn't have a range check but I think KVM will be fine
    because the whole point of hugepages is that EPT/NPT will also use a huge
    pmd when they notice gup returns pages with PageCompound set, so they
    won't care of a range and there's just the pmd young bit to check in that
    case.

    NOTE: in some cases if the L2 cache is small, this may slowdown and waste
    memory during COWs because 4M of memory are accessed in a single fault
    instead of 8k (the payoff is that after COW the program can run faster).
    So we might want to switch the copy_huge_page (and clear_huge_page too) to
    not temporal stores. I also extensively researched ways to avoid this
    cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k
    up to 1M (I can send those patches that fully implemented prefault) but I
    concluded they're not worth it and they add an huge additional complexity
    and they remove all tlb benefits until the full hugepage has been faulted
    in, to save a little bit of memory and some cache during app startup, but
    they still don't improve substantially the cache-trashing during startup
    if the prefault happens in >4k chunks. One reason is that those 4k pte
    entries copied are still mapped on a perfectly cache-colored hugepage, so
    the trashing is the worst one can generate in those copies (cow of 4k page
    copies aren't so well colored so they trashes less, but again this results
    in software running faster after the page fault). Those prefault patches
    allowed things like a pte where post-cow pages were local 4k regular anon
    pages and the not-yet-cowed pte entries were pointing in the middle of
    some hugepage mapped read-only. If it doesn't payoff substantially with
    todays hardware it will payoff even less in the future with larger l2
    caches, and the prefault logic would blot the VM a lot. If one is
    emebdded transparent_hugepage can be disabled during boot with sysfs or
    with the boot commandline parameter transparent_hugepage=0 (or
    transparent_hugepage=2 to restrict hugepages inside madvise regions) that
    will ensure not a single hugepage is allocated at boot time. It is simple
    enough to just disable transparent hugepage globally and let transparent
    hugepages be allocated selectively by applications in the MADV_HUGEPAGE
    region (both at page fault time, and if enabled with the
    collapse_huge_page too through the kernel daemon).

    This patch supports only hugepages mapped in the pmd, archs that have
    smaller hugepages will not fit in this patch alone. Also some archs like
    power have certain tlb limits that prevents mixing different page size in
    the same regions so they will not fit in this framework that requires
    "graceful fallback" to basic PAGE_SIZE in case of physical memory
    fragmentation. hugetlbfs remains a perfect fit for those because its
    software limits happen to match the hardware limits. hugetlbfs also
    remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped
    to be found not fragmented after a certain system uptime and that would be
    very expensive to defragment with relocation, so requiring reservation.
    hugetlbfs is the "reservation way", the point of transparent hugepages is
    not to have any reservation at all and maximizing the use of cache and
    hugepages at all times automatically.

    Some performance result:

    vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
    ages3
    memset page fault 1566023
    memset tlb miss 453854
    memset second tlb miss 453321
    random access tlb miss 41635
    random access second tlb miss 41658
    vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
    memset page fault 1566471
    memset tlb miss 453375
    memset second tlb miss 453320
    random access tlb miss 41636
    random access second tlb miss 41637
    vmx andrea # ./largepages3
    memset page fault 1566642
    memset tlb miss 453417
    memset second tlb miss 453313
    random access tlb miss 41630
    random access second tlb miss 41647
    vmx andrea # ./largepages3
    memset page fault 1566872
    memset tlb miss 453418
    memset second tlb miss 453315
    random access tlb miss 41618
    random access second tlb miss 41659
    vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
    vmx andrea # ./largepages3
    memset page fault 2182476
    memset tlb miss 460305
    memset second tlb miss 460179
    random access tlb miss 44483
    random access second tlb miss 44186
    vmx andrea # ./largepages3
    memset page fault 2182791
    memset tlb miss 460742
    memset second tlb miss 459962
    random access tlb miss 43981
    random access second tlb miss 43988

    ============
    #include
    #include
    #include
    #include

    #define SIZE (3UL*1024*1024*1024)

    int main()
    {
    char *p = malloc(SIZE), *p2;
    struct timeval before, after;

    gettimeofday(&before, NULL);
    memset(p, 0, SIZE);
    gettimeofday(&after, NULL);
    printf("memset page fault %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    memset(p, 0, SIZE);
    gettimeofday(&after, NULL);
    printf("memset tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    memset(p, 0, SIZE);
    gettimeofday(&after, NULL);
    printf("memset second tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    for (p2 = p; p2 < p+SIZE; p2 += 4096)
    *p2 = 0;
    gettimeofday(&after, NULL);
    printf("random access tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    for (p2 = p; p2 < p+SIZE; p2 += 4096)
    *p2 = 0;
    gettimeofday(&after, NULL);
    printf("random access second tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    return 0;
    }
    ============

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • After releasing the compound_lock split_huge_page can still run and release the
    page before put_page_testzero runs.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Alter compound get_page/put_page to keep references on subpages too, in
    order to allow __split_huge_page_refcount to split an hugepage even while
    subpages have been pinned by one of the get_user_pages() variants.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

28 Oct, 2010

1 commit

  • Replace iterated page_cache_release() with release_pages(), which is
    faster and shorter.

    Needs release_pages() to be exported to modules.

    Suggested-by: Andrew Morton
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

25 May, 2010

1 commit


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

07 Mar, 2010

1 commit

  • free_hot_page() is just a wrapper around free_hot_cold_page() with
    parameter 'cold = 0'. After adding a clear comment for
    free_hot_cold_page(), it is reasonable to remove a level of call.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Li Hong
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Cc: KOSAKI Motohiro
    Cc: Americo Wang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Hong
     

22 Sep, 2009

3 commits

  • Sizing of memory allocations shouldn't depend on the number of physical
    pages found in a system, as that generally includes (perhaps a huge amount
    of) non-RAM pages. The amount of what actually is usable as storage
    should instead be used as a basis here.

    Some of the calculations (i.e. those not intending to use high memory)
    should likely even use (totalram_pages - totalhigh_pages).

    Signed-off-by: Jan Beulich
    Acked-by: Rusty Russell
    Acked-by: Ingo Molnar
    Cc: Dave Airlie
    Cc: Kyle McMartin
    Cc: Jeremy Fitzhardinge
    Cc: Pekka Enberg
    Cc: Hugh Dickins
    Cc: "David S. Miller"
    Cc: Patrick McHardy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     
  • page_is_file_cache() has been used for both boolean checks and LRU
    arithmetic, which was always a bit weird.

    Now that page_lru_base_type() exists for LRU arithmetic, make
    page_is_file_cache() a real predicate function and adjust the
    boolean-using callsites to drop those pesky double negations.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Instead of abusing page_is_file_cache() for LRU list index arithmetic, add
    another helper with a more appropriate name and convert the non-boolean
    users of page_is_file_cache() accordingly.

    This new helper gives the LRU base type a page is supposed to live on,
    inactive anon or inactive file.

    [hugh.dickins@tiscali.co.uk: convert del_page_from_lru() also]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

03 May, 2009

1 commit

  • The Committed_AS field can underflow in certain situations:

    > # while true; do cat /proc/meminfo | grep _AS; sleep 1; done | uniq -c
    > 1 Committed_AS: 18446744073709323392 kB
    > 11 Committed_AS: 18446744073709455488 kB
    > 6 Committed_AS: 35136 kB
    > 5 Committed_AS: 18446744073709454400 kB
    > 7 Committed_AS: 35904 kB
    > 3 Committed_AS: 18446744073709453248 kB
    > 2 Committed_AS: 34752 kB
    > 9 Committed_AS: 18446744073709453248 kB
    > 8 Committed_AS: 34752 kB
    > 3 Committed_AS: 18446744073709320960 kB
    > 7 Committed_AS: 18446744073709454080 kB
    > 3 Committed_AS: 18446744073709320960 kB
    > 5 Committed_AS: 18446744073709454080 kB
    > 6 Committed_AS: 18446744073709320960 kB

    Because NR_CPUS can be greater than 1000 and meminfo_proc_show() does
    not check for underflow.

    But NR_CPUS proportional isn't good calculation. In general,
    possibility of lock contention is proportional to the number of online
    cpus, not theorical maximum cpus (NR_CPUS).

    The current kernel has generic percpu-counter stuff. using it is right
    way. it makes code simplify and percpu_counter_read_positive() don't
    make underflow issue.

    Reported-by: Dave Hansen
    Signed-off-by: KOSAKI Motohiro
    Cc: Eric B Munson
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: [All kernel versions]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

03 Apr, 2009

1 commit

  • Recruit a page flag to aid in cache management. The following extra flag is
    defined:

    (1) PG_fscache (PG_private_2)

    The marked page is backed by a local cache and is pinning resources in the
    cache driver.

    If PG_fscache is set, then things that checked for PG_private will now also
    check for that. This includes things like truncation and page invalidation.
    The function page_has_private() had been added to make the checks for both
    PG_private and PG_private_2 at the same time.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Rik van Riel
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     

01 Apr, 2009

1 commit


09 Jan, 2009

3 commits

  • Introduce mem_cgroup_per_zone::reclaim_stat member and its statics
    collecting function.

    Now, get_scan_ratio() can calculate correct value on memcg reclaim.

    [hugh@veritas.com: avoid reclaim_stat oops when disabled]
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Add zone_reclam_stat struct for later enhancement.

    A later patch uses this. This patch doesn't any behavior change (yet).

    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • A big patch for changing memcg's LRU semantics.

    Now,
    - page_cgroup is linked to mem_cgroup's its own LRU (per zone).

    - LRU of page_cgroup is not synchronous with global LRU.

    - page and page_cgroup is one-to-one and statically allocated.

    - To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as
    - lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc);

    - SwapCache is handled.

    And, when we handle LRU list of page_cgroup, we do following.

    pc = lookup_page_cgroup(page);
    lock_page_cgroup(pc); .....................(1)
    mz = page_cgroup_zoneinfo(pc);
    spin_lock(&mz->lru_lock);
    .....add to LRU
    spin_unlock(&mz->lru_lock);
    unlock_page_cgroup(pc);

    But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock.
    So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct.

    This is a trial to remove this dirty nesting of locks.
    This patch changes mz->lru_lock to be zone->lru_lock.
    Then, above sequence will be written as

    spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
    mem_cgroup_add/remove/etc_lru() {
    pc = lookup_page_cgroup(page);
    mz = page_cgroup_zoneinfo(pc);
    if (PageCgroupUsed(pc)) {
    ....add to LRU
    }
    spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU

    This is much simpler.
    (*) We're safe even if we don't take lock_page_cgroup(pc). Because..
    1. When pc->mem_cgroup can be modified.
    - at charge.
    - at account_move().
    2. at charge
    the PCG_USED bit is not set before pc->mem_cgroup is fixed.
    3. at account_move()
    the page is isolated and not on LRU.

    Pros.
    - easy for maintenance.
    - memcg can make use of laziness of pagevec.
    - we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup.
    - LRU status of memcg will be synchronized with global LRU's one.
    - # of locks are reduced.
    - account_move() is simplified very much.
    Cons.
    - may increase cost of LRU rotation.
    (no impact if memcg is not configured.)

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

07 Jan, 2009

3 commits

  • remove_exclusive_swap_page(): its problem is in living up to its name.

    It doesn't matter if someone else has a reference to the page (raised
    page_count); it doesn't matter if the page is mapped into userspace
    (raised page_mapcount - though that hints it may be worth keeping the
    swap): all that matters is that there be no more references to the swap
    (and no writeback in progress).

    swapoff (try_to_unuse) has been removing pages from swapcache for years,
    with no concern for page count or page mapcount, and we used to have a
    comment in lookup_swap_cache() recognizing that: if you go for a page of
    swapcache, you'll get the right page, but it could have been removed from
    swapcache by the time you get page lock.

    So, give up asking for exclusivity: get rid of
    remove_exclusive_swap_page(), and remove_exclusive_swap_page_ref() and
    remove_exclusive_swap_page_count() which were spawned for the recent LRU
    work: replace them by the simpler try_to_free_swap() which just checks
    page_swapcount().

    Similarly, remove the page_count limitation from free_swap_and_count(),
    but assume that it's worth holding on to the swap if page is mapped and
    swap nowhere near full. Add a vm_swap_full() test in free_swap_cache()?
    It would be consistent, but I think we probably have enough for now.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Rik van Riel
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • lru_cache_add_active_or_unevictable() and page_add_new_anon_rmap() always
    appear together. Save some symbol table space and some jumping around by
    removing lru_cache_add_active_or_unevictable(), folding its code into
    page_add_new_anon_rmap(): like how we add file pages to lru just after
    adding them to page cache.

    Remove the nearby "TODO: is this safe?" comments (yes, it is safe), and
    change page_add_new_anon_rmap()'s address BUG_ON to VM_BUG_ON as
    originally intended.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Nick Piggin
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • speculative page references patch (commit:
    e286781d5f2e9c846e012a39653a166e9d31777d) removed last
    pagevec_release_nonlru() caller.

    So this function can be removed now.

    This patch doesn't have any functional change.

    Signed-off-by: KOSAKI Motohiro
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

11 Dec, 2008

1 commit

  • Currently, lru_add_drain_all() has two version.
    (1) use schedule_on_each_cpu()
    (2) don't use schedule_on_each_cpu()

    Gerald Schaefer reported it doesn't work well on SMP (not NUMA) S390
    machine.

    offline_pages() calls lru_add_drain_all() followed by drain_all_pages().
    While drain_all_pages() works on each cpu, lru_add_drain_all() only runs
    on the current cpu for architectures w/o CONFIG_NUMA. This let us run
    into the BUG_ON(!PageBuddy(page)) in __offline_isolated_pages() during
    memory hotplug stress test on s390. The page in question was still on the
    pcp list, because of a race with lru_add_drain_all() and drain_all_pages()
    on different cpus.

    Actually, Almost machine has CONFIG_UNEVICTABLE_LRU=y. Then almost machine use
    (1) version lru_add_drain_all although the machine is UP.

    Then this ifdef is not valueable.
    simple removing is better.

    Signed-off-by: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Acked-by: Gerald Schaefer
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

03 Dec, 2008

1 commit

  • Count the insertion of new pages in the statistics used to drive the
    pageout scanning code. This should help the kernel quickly evict
    streaming file IO.

    We count on the fact that new file pages start on the inactive file LRU
    and new anonymous pages start on the active anon list. This means
    streaming file IO will increment the recent scanned file statistic, while
    leaving the recent rotated file statistic alone, driving pageout scanning
    to the file LRUs.

    Pageout activity does its own list manipulation.

    Signed-off-by: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Tested-by: Gene Heskett
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

20 Oct, 2008

5 commits

  • In the fault paths that install new anonymous pages, check whether the
    page is evictable or not using lru_cache_add_active_or_unevictable(). If
    the page is evictable, just add it to the active lru list [via the pagevec
    cache], else add it to the unevictable list.

    This "proactive" culling in the fault path mimics the handling of mlocked
    pages in Nick Piggin's series to keep mlocked pages off the lru lists.

    Notes:

    1) This patch is optional--e.g., if one is concerned about the
    additional test in the fault path. We can defer the moving of
    nonreclaimable pages until when vmscan [shrink_*_list()]
    encounters them. Vmscan will only need to handle such pages
    once, but if there are a lot of them it could impact system
    performance.

    2) The 'vma' argument to page_evictable() is require to notice that
    we're faulting a page into an mlock()ed vma w/o having to scan the
    page's rmap in the fault path. Culling mlock()ed anon pages is
    currently the only reason for this patch.

    3) We can't cull swap pages in read_swap_cache_async() because the
    vma argument doesn't necessarily correspond to the swap cache
    offset passed in by swapin_readahead(). This could [did!] result
    in mlocking pages in non-VM_LOCKED vmas if [when] we tried to
    cull in this path.

    4) Move set_pte_at() to after where we add page to lru to keep it
    hidden from other tasks that might walk the page table.
    We already do it in this order in do_anonymous() page. And,
    these are COW'd anon pages. Is this safe?

    [riel@redhat.com: undo an overzealous code cleanup]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Make sure that mlocked pages also live on the unevictable LRU, so kswapd
    will not scan them over and over again.

    This is achieved through various strategies:

    1) add yet another page flag--PG_mlocked--to indicate that
    the page is locked for efficient testing in vmscan and,
    optionally, fault path. This allows early culling of
    unevictable pages, preventing them from getting to
    page_referenced()/try_to_unmap(). Also allows separate
    accounting of mlock'd pages, as Nick's original patch
    did.

    Note: Nick's original mlock patch used a PG_mlocked
    flag. I had removed this in favor of the PG_unevictable
    flag + an mlock_count [new page struct member]. I
    restored the PG_mlocked flag to eliminate the new
    count field.

    2) add the mlock/unevictable infrastructure to mm/mlock.c,
    with internal APIs in mm/internal.h. This is a rework
    of Nick's original patch to these files, taking into
    account that mlocked pages are now kept on unevictable
    LRU list.

    3) update vmscan.c:page_evictable() to check PageMlocked()
    and, if vma passed in, the vm_flags. Note that the vma
    will only be passed in for new pages in the fault path;
    and then only if the "cull unevictable pages in fault
    path" patch is included.

    4) add try_to_unlock() to rmap.c to walk a page's rmap and
    ClearPageMlocked() if no other vmas have it mlocked.
    Reuses as much of try_to_unmap() as possible. This
    effectively replaces the use of one of the lru list links
    as an mlock count. If this mechanism let's pages in mlocked
    vmas leak through w/o PG_mlocked set [I don't know that it
    does], we should catch them later in try_to_unmap(). One
    hopes this will be rare, as it will be relatively expensive.

    Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
    Signed-off-by: Nick Piggin

    splitlru: introduce __get_user_pages():

    New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
    because current get_user_pages() can't grab PROT_NONE pages theresore it
    cause PROT_NONE pages can't munlock.

    [akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
    [akpm@linux-foundation.org: untangle patch interdependencies]
    [akpm@linux-foundation.org: fix things after out-of-order merging]
    [hugh@veritas.com: fix page-flags mess]
    [lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
    [kosaki.motohiro@jp.fujitsu.com: build fix]
    [kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
    [kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Matt Mackall
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • When the system contains lots of mlocked or otherwise unevictable pages,
    the pageout code (kswapd) can spend lots of time scanning over these
    pages. Worse still, the presence of lots of unevictable pages can confuse
    kswapd into thinking that more aggressive pageout modes are required,
    resulting in all kinds of bad behaviour.

    Infrastructure to manage pages excluded from reclaim--i.e., hidden from
    vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked to
    maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
    them from vmscan.

    Kosaki Motohiro added the support for the memory controller unevictable
    lru list.

    Pages on the unevictable list have both PG_unevictable and PG_lru set.
    Thus, PG_unevictable is analogous to and mutually exclusive with
    PG_active--it specifies which LRU list the page is on.

    The unevictable infrastructure is enabled by a new mm Kconfig option
    [CONFIG_]UNEVICTABLE_LRU.

    A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
    not a page may be evictable. Subsequent patches will add the various
    !evictable tests. We'll want to keep these tests light-weight for use in
    shrink_active_list() and, possibly, the fault path.

    To avoid races between tasks putting pages [back] onto an LRU list and
    tasks that might be moving the page from non-evictable to evictable state,
    the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
    -- tests the "evictability" of a page after placing it on the LRU, before
    dropping the reference. If the page has become unevictable,
    putback_lru_page() will redo the 'putback', thus moving the page to the
    unevictable list. This way, we avoid "stranding" evictable pages on the
    unevictable list.

    [akpm@linux-foundation.org: fix fallout from out-of-order merge]
    [riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
    [nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
    [kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
    [kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
    [kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
    [kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
    [kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Debugged-by: Benjamin Kidwell
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Split the LRU lists in two, one set for pages that are backed by real file
    systems ("file") and one for pages that are backed by memory and swap
    ("anon"). The latter includes tmpfs.

    The advantage of doing this is that the VM will not have to scan over lots
    of anonymous pages (which we generally do not want to swap out), just to
    find the page cache pages that it should evict.

    This patch has the infrastructure and a basic policy to balance how much
    we scan the anon lists and how much we scan the file lists. The big
    policy changes are in separate patches.

    [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
    [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
    [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
    [hugh@veritas.com: memcg swapbacked pages active]
    [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
    [akpm@linux-foundation.org: fix /proc/vmstat units]
    [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
    [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
    [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • If vm_swap_full() (swap space more than 50% full), the system will free
    swap space at swapin time. With this patch, the system will also free the
    swap space in the pageout code, when we decide that the page is not a
    candidate for swapout (and just wasting swap space).

    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: MinChan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel