28 May, 2010

1 commit

  • FILE_MAPPED per memcg of migrated file cache is not properly updated,
    because our hook in page_add_file_rmap() can't know to which memcg
    FILE_MAPPED should be counted.

    Basically, this patch is for fixing the bug but includes some big changes
    to fix up other messes.

    Now, at migrating mapped file, events happen in following sequence.

    1. allocate a new page.
    2. get memcg of an old page.
    3. charge ageinst a new page before migration. But at this point,
    no changes to new page's page_cgroup, no commit for the charge.
    (IOW, PCG_USED bit is not set.)
    4. page migration replaces radix-tree, old-page and new-page.
    5. page migration remaps the new page if the old page was mapped.
    6. Here, the new page is unlocked.
    7. memcg commits the charge for newpage, Mark the new page's page_cgroup
    as PCG_USED.

    Because "commit" happens after page-remap, we can count FILE_MAPPED
    at "5", because we should avoid to trust page_cgroup->mem_cgroup.
    if PCG_USED bit is unset.
    (Note: memcg's LRU removal code does that but LRU-isolation logic is used
    for helping it. When we overwrite page_cgroup->mem_cgroup, page_cgroup is
    not on LRU or page_cgroup->mem_cgroup is NULL.)

    We can lose file_mapped accounting information at 5 because FILE_MAPPED
    is updated only when mapcount changes 0->1. So we should catch it.

    BTW, historically, above implemntation comes from migration-failure
    of anonymous page. Because we charge both of old page and new page
    with mapcount=0, we can't catch
    - the page is really freed before remap.
    - migration fails but it's freed before remap
    or .....corner cases.

    New migration sequence with memcg is:

    1. allocate a new page.
    2. mark PageCgroupMigration to the old page.
    3. charge against a new page onto the old page's memcg. (here, new page's pc
    is marked as PageCgroupUsed.)
    4. page migration replaces radix-tree, page table, etc...
    5. At remapping, new page's page_cgroup is now makrked as "USED"
    We can catch 0->1 event and FILE_MAPPED will be properly updated.

    And we can catch SWAPOUT event after unlock this and freeing this
    page by unmap() can be caught.

    7. Clear PageCgroupMigration of the old page.

    So, FILE_MAPPED will be correctly updated.

    Then, for what MIGRATION flag is ?
    Without it, at migration failure, we may have to charge old page again
    because it may be fully unmapped. "charge" means that we have to dive into
    memory reclaim or something complated. So, it's better to avoid
    charge it again. Before this patch, __commit_charge() was working for
    both of the old/new page and fixed up all. But this technique has some
    racy condtion around FILE_MAPPED and SWAPOUT etc...
    Now, the kernel use MIGRATION flag and don't uncharge old page until
    the end of migration.

    I hope this change will make memcg's page migration much simpler. This
    page migration has caused several troubles. Worth to add a flag for
    simplification.

    Reviewed-by: Daisuke Nishimura
    Tested-by: Daisuke Nishimura
    Reported-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@linux-foundation.org
     

25 May, 2010

6 commits

  • This patch is the core of a mechanism which compacts memory in a zone by
    relocating movable pages towards the end of the zone.

    A single compaction run involves a migration scanner and a free scanner.
    Both scanners operate on pageblock-sized areas in the zone. The migration
    scanner starts at the bottom of the zone and searches for all movable
    pages within each area, isolating them onto a private list called
    migratelist. The free scanner starts at the top of the zone and searches
    for suitable areas and consumes the free pages within making them
    available for the migration scanner. The pages isolated for migration are
    then migrated to the newly isolated free pages.

    [aarcange@redhat.com: Fix unsafe optimisation]
    [mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • PageAnon pages that are unmapped may or may not have an anon_vma so are
    not currently migrated. However, a swap cache page can be migrated and
    fits this description. This patch identifies page swap caches and allows
    them to be migrated but ensures that no attempt to made to remap the pages
    would would potentially try to access an already freed anon_vma.

    Signed-off-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • rmap_walk_anon() was triggering errors in memory compaction that look like
    use-after-free errors. The problem is that between the page being
    isolated from the LRU and rcu_read_lock() being taken, the mapcount of the
    page dropped to 0 and the anon_vma gets freed. This can happen during
    memory compaction if pages being migrated belong to a process that exits
    before migration completes. Hence, the use-after-free race looks like

    1. Page isolated for migration
    2. Process exits
    3. page_mapcount(page) drops to zero so anon_vma was no longer reliable
    4. unmap_and_move() takes the rcu_lock but the anon_vma is already garbage
    4. call try_to_unmap, looks up tha anon_vma and "locks" it but the lock
    is garbage.

    This patch checks the mapcount after the rcu lock is taken. If the
    mapcount is zero, the anon_vma is assumed to be freed and no further
    action is taken.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • For clarity of review, KSM and page migration have separate refcounts on
    the anon_vma. While clear, this is a waste of memory. This patch gets
    KSM and page migration to share their toys in a spirit of harmony.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patchset is a memory compaction mechanism that reduces external
    fragmentation memory by moving GFP_MOVABLE pages to a fewer number of
    pageblocks. The term "compaction" was chosen as there are is a number of
    mechanisms that are not mutually exclusive that can be used to defragment
    memory. For example, lumpy reclaim is a form of defragmentation as was
    slub "defragmentation" (really a form of targeted reclaim). Hence, this
    is called "compaction" to distinguish it from other forms of
    defragmentation.

    In this implementation, a full compaction run involves two scanners
    operating within a zone - a migration and a free scanner. The migration
    scanner starts at the beginning of a zone and finds all movable pages
    within one pageblock_nr_pages-sized area and isolates them on a
    migratepages list. The free scanner begins at the end of the zone and
    searches on a per-area basis for enough free pages to migrate all the
    pages on the migratepages list. As each area is respectively migrated or
    exhausted of free pages, the scanners are advanced one area. A compaction
    run completes within a zone when the two scanners meet.

    This method is a bit primitive but is easy to understand and greater
    sophistication would require maintenance of counters on a per-pageblock
    basis. This would have a big impact on allocator fast-paths to improve
    compaction which is a poor trade-off.

    It also does not try relocate virtually contiguous pages to be physically
    contiguous. However, assuming transparent hugepages were in use, a
    hypothetical khugepaged might reuse compaction code to isolate free pages,
    split them and relocate userspace pages for promotion.

    Memory compaction can be triggered in one of three ways. It may be
    triggered explicitly by writing any value to /proc/sys/vm/compact_memory
    and compacting all of memory. It can be triggered on a per-node basis by
    writing any value to /sys/devices/system/node/nodeN/compact where N is the
    node ID to be compacted. When a process fails to allocate a high-order
    page, it may compact memory in an attempt to satisfy the allocation
    instead of entering direct reclaim. Explicit compaction does not finish
    until the two scanners meet and direct compaction ends if a suitable page
    becomes available that would meet watermarks.

    The series is in 14 patches. The first three are not "core" to the series
    but are important pre-requisites.

    Patch 1 reference counts anon_vma for rmap_walk_anon(). Without this
    patch, it's possible to use anon_vma after free if the caller is
    not holding a VMA or mmap_sem for the pages in question. While
    there should be no existing user that causes this problem,
    it's a requirement for memory compaction to be stable. The patch
    is at the start of the series for bisection reasons.
    Patch 2 merges the KSM and migrate counts. It could be merged with patch 1
    but would be slightly harder to review.
    Patch 3 skips over unmapped anon pages during migration as there are no
    guarantees about the anon_vma existing. There is a window between
    when a page was isolated and migration started during which anon_vma
    could disappear.
    Patch 4 notes that PageSwapCache pages can still be migrated even if they
    are unmapped.
    Patch 5 allows CONFIG_MIGRATION to be set without CONFIG_NUMA
    Patch 6 exports a "unusable free space index" via debugfs. It's
    a measure of external fragmentation that takes the size of the
    allocation request into account. It can also be calculated from
    userspace so can be dropped if requested
    Patch 7 exports a "fragmentation index" which only has meaning when an
    allocation request fails. It determines if an allocation failure
    would be due to a lack of memory or external fragmentation.
    Patch 8 moves the definition for LRU isolation modes for use by compaction
    Patch 9 is the compaction mechanism although it's unreachable at this point
    Patch 10 adds a means of compacting all of memory with a proc trgger
    Patch 11 adds a means of compacting a specific node with a sysfs trigger
    Patch 12 adds "direct compaction" before "direct reclaim" if it is
    determined there is a good chance of success.
    Patch 13 adds a sysctl that allows tuning of the threshold at which the
    kernel will compact or direct reclaim
    Patch 14 temporarily disables compaction if an allocation failure occurs
    after compaction.

    Testing of compaction was in three stages. For the test, debugging,
    preempt, the sleep watchdog and lockdep were all enabled but nothing nasty
    popped out. min_free_kbytes was tuned as recommended by hugeadm to help
    fragmentation avoidance and high-order allocations. It was tested on X86,
    X86-64 and PPC64.

    Ths first test represents one of the easiest cases that can be faced for
    lumpy reclaim or memory compaction.

    1. Machine freshly booted and configured for hugepage usage with
    a) hugeadm --create-global-mounts
    b) hugeadm --pool-pages-max DEFAULT:8G
    c) hugeadm --set-recommended-min_free_kbytes
    d) hugeadm --set-recommended-shmmax

    The min_free_kbytes here is important. Anti-fragmentation works best
    when pageblocks don't mix. hugeadm knows how to calculate a value that
    will significantly reduce the worst of external-fragmentation-related
    events as reported by the mm_page_alloc_extfrag tracepoint.

    2. Load up memory
    a) Start updatedb
    b) Create in parallel a X files of pagesize*128 in size. Wait
    until files are created. By parallel, I mean that 4096 instances
    of dd were launched, one after the other using &. The crude
    objective being to mix filesystem metadata allocations with
    the buffer cache.
    c) Delete every second file so that pageblocks are likely to
    have holes
    d) kill updatedb if it's still running

    At this point, the system is quiet, memory is full but it's full with
    clean filesystem metadata and clean buffer cache that is unmapped.
    This is readily migrated or discarded so you'd expect lumpy reclaim
    to have no significant advantage over compaction but this is at
    the POC stage.

    3. In increments, attempt to allocate 5% of memory as hugepages.
    Measure how long it took, how successful it was, how many
    direct reclaims took place and how how many compactions. Note
    the compaction figures might not fully add up as compactions
    can take place for orders other than the hugepage size

    X86 vanilla compaction
    Final page count 913 916 (attempted 1002)
    pages reclaimed 68296 9791

    X86-64 vanilla compaction
    Final page count: 901 902 (attempted 1002)
    Total pages reclaimed: 112599 53234

    PPC64 vanilla compaction
    Final page count: 93 94 (attempted 110)
    Total pages reclaimed: 103216 61838

    There was not a dramatic improvement in success rates but it wouldn't be
    expected in this case either. What was important is that fewer pages were
    reclaimed in all cases reducing the amount of IO required to satisfy a
    huge page allocation.

    The second tests were all performance related - kernbench, netperf, iozone
    and sysbench. None showed anything too remarkable.

    The last test was a high-order allocation stress test. Many kernel
    compiles are started to fill memory with a pressured mix of unmovable and
    movable allocations. During this, an attempt is made to allocate 90% of
    memory as huge pages - one at a time with small delays between attempts to
    avoid flooding the IO queue.

    vanilla compaction
    Percentage of request allocated X86 98 99
    Percentage of request allocated X86-64 95 98
    Percentage of request allocated PPC64 55 70

    This patch:

    rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
    locking an anon_vma and it does not appear to have sufficient locking to
    ensure the anon_vma does not disappear from under it.

    This patch copies an approach used by KSM to take a reference on the
    anon_vma while pages are being migrated. This should prevent rmap_walk()
    running into nasty surprises later because anon_vma has been freed.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • putback_lru_page() never can fail. So it doesn't matter count of "the
    number of pages put back".

    In addition, users of this functions don't use return value.

    Let's remove unnecessary code.

    Signed-off-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

07 Mar, 2010

1 commit


02 Mar, 2010

1 commit

  • * 'for-linus' of master.kernel.org:/home/rmk/linux-2.6-arm: (100 commits)
    ARM: Eliminate decompressor -Dstatic= PIC hack
    ARM: 5958/1: ARM: U300: fix inverted clk round rate
    ARM: 5956/1: misplaced parentheses
    ARM: 5955/1: ep93xx: move timer defines into core.c and document
    ARM: 5954/1: ep93xx: move gpio interrupt support to gpio.c
    ARM: 5953/1: ep93xx: fix broken build of clock.c
    ARM: 5952/1: ARM: MM: Add ARM_L1_CACHE_SHIFT_6 for handle inside each ARCH Kconfig
    ARM: 5949/1: NUC900 add gpio virtual memory map
    ARM: 5948/1: Enable timer0 to time4 clock support for nuc910
    ARM: 5940/2: ARM: MMCI: remove custom DBG macro and printk
    ARM: make_coherent(): fix problems with highpte, part 2
    MM: Pass a PTE pointer to update_mmu_cache() rather than the PTE itself
    ARM: 5945/1: ep93xx: include correct irq.h in core.c
    ARM: 5933/1: amba-pl011: support hardware flow control
    ARM: 5930/1: Add PKMAP area description to memory.txt.
    ARM: 5929/1: Add checks to detect overlap of memory regions.
    ARM: 5928/1: Change type of VMALLOC_END to unsigned long.
    ARM: 5927/1: Make delimiters of DMA area globally visibly.
    ARM: 5926/1: Add "Virtual kernel memory..." printout.
    ARM: 5920/1: OMAP4: Enable L2 Cache
    ...

    Fix up trivial conflict in arch/arm/mach-mx25/clock.c

    Linus Torvalds
     

22 Feb, 2010

1 commit

  • x86-32 has had a static test for copy_on_user() overflow for a while.
    This test currently fails in mm/migrate.c resulting in an
    allyesconfig/allmodconfig build failure on x86-32:

    In function ‘copy_from_user’,
    inlined from ‘do_pages_stat’ at
    /home/hpa/kernel/git/mm/migrate.c:1012:
    /home/hpa/kernel/git/arch/x86/include/asm/uaccess_32.h:212: error:
    call to ‘copy_from_user_overflow’ declared

    Make the logic more explicit and therefore easier for gcc to
    understand.

    v2: rewrite the loop entirely using a more normal structure for a
    chunked-data loop (Linus Torvalds)

    Reported-by: Len Brown
    Signed-off-by: H. Peter Anvin
    Reviewed-and-Tested-by: KOSAKI Motohiro
    Cc: Arjan van de Ven
    Cc: Andrew Morton
    Cc: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Signed-off-by: Linus Torvalds

    H. Peter Anvin
     

21 Feb, 2010

1 commit

  • On VIVT ARM, when we have multiple shared mappings of the same file
    in the same MM, we need to ensure that we have coherency across all
    copies. We do this via make_coherent() by making the pages
    uncacheable.

    This used to work fine, until we allowed highmem with highpte - we
    now have a page table which is mapped as required, and is not available
    for modification via update_mmu_cache().

    Ralf Beache suggested getting rid of the PTE value passed to
    update_mmu_cache():

    On MIPS update_mmu_cache() calls __update_tlb() which walks pagetables
    to construct a pointer to the pte again. Passing a pte_t * is much
    more elegant. Maybe we might even replace the pte argument with the
    pte_t?

    Ben Herrenschmidt would also like the pte pointer for PowerPC:

    Passing the ptep in there is exactly what I want. I want that
    -instead- of the PTE value, because I have issue on some ppc cases,
    for I$/D$ coherency, where set_pte_at() may decide to mask out the
    _PAGE_EXEC.

    So, pass in the mapped page table pointer into update_mmu_cache(), and
    remove the PTE value, updating all implementations and call sites to
    suit.

    Includes a fix from Stephen Rothwell:

    sparc: fix fallout from update_mmu_cache API change

    Signed-off-by: Stephen Rothwell

    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Russell King

    Russell King
     

07 Feb, 2010

1 commit

  • We incorrectly depended on the 'node_state/node_isset()' functions
    testing the node range, rather than checking it explicitly. That's not
    reliable, even if it might often happen to work. So do the proper
    explicit test.

    Reported-by: Marcus Meissner
    Acked-and-tested-by: Brice Goglin
    Acked-by: Hugh Dickins
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

16 Dec, 2009

5 commits

  • unevictable_migrate_page() in mm/internal.h is a relic of the since
    removed UNEVICTABLE_LRU Kconfig option. This patch removes the function
    and open codes the test in migrate_page_copy().

    Signed-off-by: Lee Schermerhorn
    Reviewed-by: Christoph Lameter
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • The previous patch enables page migration of ksm pages, but that soon gets
    into trouble: not surprising, since we're using the ksm page lock to lock
    operations on its stable_node, but page migration switches the page whose
    lock is to be used for that. Another layer of locking would fix it, but
    do we need that yet?

    Do we actually need page migration of ksm pages? Yes, memory hotremove
    needs to offline sections of memory: and since we stopped allocating ksm
    pages with GFP_HIGHUSER, they will tend to be GFP_HIGHUSER_MOVABLE
    candidates for migration.

    But KSM is currently unconscious of NUMA issues, happily merging pages
    from different NUMA nodes: at present the rule must be, not to use
    MADV_MERGEABLE where you care about NUMA. So no, NUMA page migration of
    ksm pages does not make sense yet.

    So, to complete support for ksm swapping we need to make hotremove safe.
    ksm_memory_callback() take ksm_thread_mutex when MEM_GOING_OFFLINE and
    release it when MEM_OFFLINE or MEM_CANCEL_OFFLINE. But if mapped pages
    are freed before migration reaches them, stable_nodes may be left still
    pointing to struct pages which have been removed from the system: the
    stable_node needs to identify a page by pfn rather than page pointer, then
    it can safely prune them when MEM_OFFLINE.

    And make NUMA migration skip PageKsm pages where it skips PageReserved.
    But it's only when we reach unmap_and_move() that the page lock is taken
    and we can be sure that raised pagecount has prevented a PageAnon from
    being upgraded: so add offlining arg to migrate_pages(), to migrate ksm
    page when offlining (has sufficient locking) but reject it otherwise.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A side-effect of making ksm pages swappable is that they have to be placed
    on the LRUs: which then exposes them to isolate_lru_page() and hence to
    page migration.

    Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
    rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps some
    consolidation with existing code is possible, but don't attempt that yet
    (try_to_unmap needs to handle nonlinears, but migration pte removal does
    not).

    rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
    remove_anon_migration_ptes() which it replaces, avoids calling
    page_lock_anon_vma(), because that includes a page_mapped() test which
    fails when all migration ptes are in place. That was valid when NUMA page
    migration was introduced (holding mmap_sem provided the missing guarantee
    that anon_vma's slab had not already been destroyed), but I believe not
    valid in the memory hotremove case added since.

    For now do the same as before, and consider the best way to fix that
    unlikely race later on. When fixed, we can probably use rmap_walk() on
    hwpoisoned ksm pages too: for now, they remain among hwpoison's various
    exceptions (its PageKsm test comes before the page is locked, but its
    page_lock_anon_vma fails safely if an anon gets upgraded).

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • At present we define PageAnon(page) by the low PAGE_MAPPING_ANON bit set
    in page->mapping, with the higher bits a pointer to the anon_vma; and have
    defined PageKsm(page) as that with NULL anon_vma.

    But KSM swapping will need to store a pointer there: so in preparation for
    that, now define PAGE_MAPPING_FLAGS as the low two bits, including
    PAGE_MAPPING_KSM (always set along with PAGE_MAPPING_ANON, until some
    other use for the bit emerges).

    Declare page_rmapping(page) to return the pointer part of page->mapping,
    and page_anon_vma(page) to return the anon_vma pointer when that's what it
    is. Use these in a few appropriate places: notably, unuse_vma() has been
    testing page->mapping, but is better to be testing page_anon_vma() (cases
    may be added in which flag bits are set without any pointer).

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Reviewed-by: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Christoph pointed out inc_zone_page_state(NR_ISOLATED) should be placed
    in right after isolate_page().

    This patch does it.

    Reviewed-by: Christoph Lameter
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

12 Dec, 2009

1 commit

  • Slightly adjust the logic for determining the size of the
    copy_form_user() in do_pages_stat(); with this change, gcc can see
    that the copying is safe.

    Without this, we get a build error for i386 allyesconfig:

    /home/hpa/kernel/linux-2.6-tip.urgent/arch/x86/include/asm/uaccess_32.h:213:
    error: call to ‘copy_from_user_overflow’ declared with attribute
    error: copy_from_user() buffer size is not provably correct

    Unlike an earlier patch from Arjan, this doesn't introduce new
    variables; merely reshuffles the compare so that gcc can see that an
    overflow cannot happen.

    Signed-off-by: H. Peter Anvin
    Cc: Brice Goglin
    Cc: Arjan van de Ven
    Cc: Andrew Morton
    Cc: KOSAKI Motohiro
    LKML-Reference:

    H. Peter Anvin
     

12 Nov, 2009

1 commit

  • Lee Schermerhorn reported that he saw bad pointer dereference in
    mem_cgroup_end_migration() when he disabled memcg by boot option.

    memcg's page migration logic works as

    mem_cgroup_prepare_migration(page, &ptr);
    do page migration
    mem_cgroup_end_migration(page, ptr);

    Now, ptr is not initialized in prepare_migration when memcg is disabled
    by boot option. This causes panic in end_migration. This patch fixes it.

    Reported-by: Lee Schermerhorn
    Cc: Balbir Singh
    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

24 Sep, 2009

1 commit

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     

22 Sep, 2009

5 commits

  • Make page_has_private() return a true boolean value and remove the double
    negations from the two callsites using it for arithmetic.

    Signed-off-by: Johannes Weiner
    Cc: Christoph Lameter
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • page_is_file_cache() has been used for both boolean checks and LRU
    arithmetic, which was always a bit weird.

    Now that page_lru_base_type() exists for LRU arithmetic, make
    page_is_file_cache() a real predicate function and adjust the
    boolean-using callsites to drop those pesky double negations.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • If the system is running a heavy load of processes then concurrent reclaim
    can isolate a large number of pages from the LRU. /proc/vmstat and the
    output generated for an OOM do not show how many pages were isolated.

    This has been observed during process fork bomb testing (mstctl11 in LTP).

    This patch shows the information about isolated pages.

    Reproduced via:

    -----------------------
    % ./hackbench 140 process 1000
    => OOM occur

    active_anon:146 inactive_anon:0 isolated_anon:49245
    active_file:79 inactive_file:18 isolated_file:113
    unevictable:0 dirty:0 writeback:0 unstable:0 buffer:39
    free:370 slab_reclaimable:309 slab_unreclaimable:5492
    mapped:53 shmem:15 pagetables:28140 bounce:0

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Acked-by: Wu Fengguang
    Reviewed-by: Minchan Kim
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Recently we encountered OOM problems due to memory use of the GEM cache.
    Generally a large amuont of Shmem/Tmpfs pages tend to create a memory
    shortage problem.

    We often use the following calculation to determine the amount of shmem
    pages:

    shmem = NR_ACTIVE_ANON + NR_INACTIVE_ANON - NR_ANON_PAGES

    however the expression does not consider isolated and mlocked pages.

    This patch adds explicit accounting for pages used by shmem and tmpfs.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Reviewed-by: Christoph Lameter
    Acked-by: Wu Fengguang
    Cc: David Rientjes
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • In test, some pages in swap-cache can't be migrated, as they aren't rmap.

    unmap_and_move() ignores swap-cache page which is just read in and hasn't
    rmap (see the comments in the code), but swap_aops provides .migratepage.
    Better to migrate such pages instead of ignore them.

    Signed-off-by: Shaohua Li
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Yakui Zhao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

16 Sep, 2009

1 commit

  • try_to_unmap currently has multiple modi (migration, munlock, normal unmap)
    which are selected by magic flag variables. The logic is not very straight
    forward, because each of these flag change multiple behaviours (e.g.
    migration turns off aging, not only sets up migration ptes etc.)
    Also the different flags interact in magic ways.

    A later patch in this series adds another mode to try_to_unmap, so
    this becomes quickly unmanageable.

    Replace the different flags with a action code (migration, munlock, munmap)
    and some additional flags as modifiers (ignore mlock, ignore aging).
    This makes the logic more straight forward and allows easier extension
    to new behaviours. Change all the caller to declare what they want to
    do.

    This patch is supposed to be a nop in behaviour. If anyone can prove
    it is not that would be a bug.

    Cc: Lee.Schermerhorn@hp.com
    Cc: npiggin@suse.de

    Signed-off-by: Andi Kleen

    Andi Kleen
     

17 Jun, 2009

2 commits

  • migrate_prep() is fairly expensive (72us on 16-core barcelona 1.9GHz).
    Commit 3140a2273009c01c27d316f35ab76a37e105fdd8 improved move_pages()
    throughput by breaking it into chunks, but it also made migrate_prep() be
    called once per chunk (every 128pages or so) instead of once per
    move_pages().

    This patch reverts to calling migrate_prep() only once per chunk as we did
    before 2.6.29. It is also a followup to commit
    0aedadf91a70a11c4a3e7c7d99b21e5528af8d5d ("mm: move migrate_prep out from
    under mmap_sem").

    This improves migration throughput on the above machine from 600MB/s to
    750MB/s.

    Signed-off-by: Brice Goglin
    Acked-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    Cc: Heiko Carstens
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     
  • Callers of alloc_pages_node() can optionally specify -1 as a node to mean
    "allocate from the current node". However, a number of the callers in
    fast paths know for a fact their node is valid. To avoid a comparison and
    branch, this patch adds alloc_pages_exact_node() that only checks the nid
    with VM_BUG_ON(). Callers that know their node is valid are then
    converted.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Pekka Enberg
    Acked-by: Paul Mundt [for the SLOB NUMA bits]
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

03 Apr, 2009

1 commit

  • Recruit a page flag to aid in cache management. The following extra flag is
    defined:

    (1) PG_fscache (PG_private_2)

    The marked page is backed by a local cache and is pinning resources in the
    cache driver.

    If PG_fscache is set, then things that checked for PG_private will now also
    check for that. This includes things like truncation and page invalidation.
    The function page_has_private() had been added to make the checks for both
    PG_private and PG_private_2 at the same time.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Rik van Riel
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     

12 Feb, 2009

1 commit


14 Jan, 2009

1 commit


09 Jan, 2009

2 commits

  • Now, management of "charge" under page migration is done under following
    manner. (Assume migrate page contents from oldpage to newpage)

    before
    - "newpage" is charged before migration.
    at success.
    - "oldpage" is uncharged at somewhere(unmap, radix-tree-replace)
    at failure
    - "newpage" is uncharged.
    - "oldpage" is charged if necessary (*1)

    But (*1) is not reliable....because of GFP_ATOMIC.

    This patch tries to change behavior as following by charge/commit/cancel ops.

    before
    - charge PAGE_SIZE (no target page)
    success
    - commit charge against "newpage".
    failure
    - commit charge against "oldpage".
    (PCG_USED bit works effectively to avoid double-counting)
    - if "oldpage" is obsolete, cancel charge of PAGE_SIZE.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • There is a small race in do_swap_page(). When the page swapped-in is
    charged, the mapcount can be greater than 0. But, at the same time some
    process (shares it ) call unmap and make mapcount 1->0 and the page is
    uncharged.

    CPUA CPUB
    mapcount == 1.
    (1) charge if mapcount==0 zap_pte_range()
    (2) mapcount 1 => 0.
    (3) uncharge(). (success)
    (4) set page's rmap()
    mapcount 0=>1

    Then, this swap page's account is leaked.

    For fixing this, I added a new interface.
    - charge
    account to res_counter by PAGE_SIZE and try to free pages if necessary.
    - commit
    register page_cgroup and add to LRU if necessary.
    - cancel
    uncharge PAGE_SIZE because of do_swap_page failure.

    CPUA
    (1) charge (always)
    (2) set page's rmap (mapcount > 0)
    (3) commit charge was necessary or not after set_pte().

    This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting.
    Usual mem_cgroup_charge_common() does charge -> commit at a time.

    And this patch also adds following function to clarify all charges.

    - mem_cgroup_newpage_charge() ....replacement for mem_cgroup_charge()
    called against newly allocated anon pages.

    - mem_cgroup_charge_migrate_fixup()
    called only from remove_migration_ptes().
    we'll have to rewrite this later.(this patch just keeps old behavior)
    This function will be removed by additional patch to make migration
    clearer.

    Good for clarifying "what we do"

    Then, we have 4 following charge points.
    - newpage
    - swap-in
    - add-to-cache.
    - migration.

    [akpm@linux-foundation.org: add missing inline directives to stubs]
    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

07 Jan, 2009

3 commits

  • If we add NOOP stubs for SetPageSwapCache() and ClearPageSwapCache(), then
    we can remove the #ifdef CONFIG_SWAPs from mm/migrate.c.

    Signed-off-by: Hugh Dickins
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • pp->page is never used when not set to the right page, so there is no need
    to set it to ZERO_PAGE(0) by default.

    Signed-off-by: Brice Goglin
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     
  • Rework do_pages_move() to work by page-sized chunks of struct page_to_node
    that are passed to do_move_page_to_node_array(). We now only have to
    allocate a single page instead a possibly very large vmalloc area to store
    all page_to_node entries.

    As a result, new_page_node() will now have a very small lookup, hidding
    much of the overall sys_move_pages() overhead.

    Signed-off-by: Brice Goglin
    Signed-off-by: Nathalie Furmento
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     

25 Dec, 2008

1 commit


17 Dec, 2008

1 commit

  • Commit 80bba1290ab5122c60cdb73332b26d288dc8aedd removed one necessary
    variable initialization. As a result following warning happened:

    CC mm/migrate.o
    mm/migrate.c: In function 'sys_move_pages':
    mm/migrate.c:1001: warning: 'err' may be used uninitialized in this function

    More unfortunately, if find_vma() failed, kernel read uninitialized
    memory.

    Signed-off-by: KOSAKI Motohiro
    CC: Brice Goglin
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

11 Dec, 2008

1 commit

  • Since commit 2f007e74bb85b9fc4eab28524052161703300f1a, do_pages_stat()
    gets the page address from user-space and puts the corresponding status
    back while holding the mmap_sem for read. There is no need to hold
    mmap_sem there while some page-faults may occur.

    This patch adds a temporary address and status buffer so as to only
    hold mmap_sem while working on these kernel buffers. This is
    implemented by extracting do_pages_stat_array() out of do_pages_stat().

    Signed-off-by: Brice Goglin
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin