28 Feb, 2013

1 commit

  • munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
    at a time. When munlocking THP pages (or the huge zero page), this
    resulted in taking the mm->page_table_lock 512 times in a row.

    We can do better by making use of the page_mask returned by
    follow_page_mask (for the huge zero page case), or the size of the page
    munlock_vma_page() operated on (for the true THP page case).

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

24 Feb, 2013

1 commit

  • In find_extend_vma(), we don't need mlock_vma_pages_range() to verify
    the vma type - we know we're working with a stack. So, we can call
    directly into __mlock_vma_pages_range(), and remove the last
    make_pages_present() call site.

    Note that we don't use mm_populate() here, so we can't release the
    mmap_sem while allocating new stack pages. This is deemed acceptable,
    because the stack vmas grow by a bounded number of pages at a time, and
    these are anon pages so we don't have to read from disk to populate
    them.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Tested-by: Andy Lutomirski
    Cc: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

12 Jan, 2013

1 commit

  • Eric Wong reported on 3.7 and 3.8-rc2 that ppoll() got stuck when
    waiting for POLLIN on a local TCP socket. It was easier to trigger if
    there was disk IO and dirty pages at the same time and he bisected it to
    commit 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page
    immediately when it is made available").

    The intention of that patch was to improve high-order allocations under
    memory pressure after changes made to reclaim in 3.6 drastically hurt
    THP allocations but the approach was flawed. For Eric, the problem was
    that page->pfmemalloc was not being cleared for captured pages leading
    to a poor interaction with swap-over-NFS support causing the packets to
    be dropped. However, I identified a few more problems with the patch
    including the fact that it can increase contention on zone->lock in some
    cases which could result in async direct compaction being aborted early.

    In retrospect the capture patch took the wrong approach. What it should
    have done is mark the pageblock being migrated as MIGRATE_ISOLATE if it
    was allocating for THP and avoided races that way. While the patch was
    showing to improve allocation success rates at the time, the benefit is
    marginal given the relative complexity and it should be revisited from
    scratch in the context of the other reclaim-related changes that have
    taken place since the patch was first written and tested. This patch
    partially reverts commit 1fb3f8ca "mm: compaction: capture a suitable
    high-order page immediately when it is made available".

    Reported-and-tested-by: Eric Wong
    Tested-by: Eric Dumazet
    Cc: stable@vger.kernel.org
    Signed-off-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

12 Dec, 2012

1 commit

  • Several place need to find the pmd by(mm_struct, address), so introduce a
    function to simplify it.

    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Bob Liu
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Ni zhan Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     

11 Dec, 2012

1 commit

  • Note: This is very heavily based on a patch from Peter Zijlstra with
    fixes from Ingo Molnar, Hugh Dickins and Johannes Weiner. That patch
    put a lot of migration logic into mm/huge_memory.c where it does
    not belong. This version puts tries to share some of the migration
    logic with migrate_misplaced_page. However, it should be noted
    that now migrate.c is doing more with the pagetable manipulation
    than is preferred. The end result is barely recognisable so as
    before, the signed-offs had to be removed but will be re-added if
    the original authors are ok with it.

    Add THP migration for the NUMA working set scanning fault case.

    It uses the page lock to serialize. No migration pte dance is
    necessary because the pte is already unmapped when we decide
    to migrate.

    [dhillf@gmail.com: Fix memory leak on isolation failure]
    [dhillf@gmail.com: Fix transfer of last_nid information]
    Signed-off-by: Mel Gorman

    Mel Gorman
     

09 Oct, 2012

12 commits

  • NR_MLOCK is only accounted in single page units: there's no logic to
    handle transparent hugepages. This patch checks the appropriate number of
    pages to adjust the statistics by so that the correct amount of memory is
    reflected.

    Currently:

    $ grep Mlocked /proc/meminfo
    Mlocked: 19636 kB

    #define MAP_SIZE (4 << 30) /* 4GB */

    void *ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
    mlock(ptr, MAP_SIZE);

    $ grep Mlocked /proc/meminfo
    Mlocked: 29844 kB

    munlock(ptr, MAP_SIZE);

    $ grep Mlocked /proc/meminfo
    Mlocked: 19636 kB

    And with this patch:

    $ grep Mlock /proc/meminfo
    Mlocked: 19636 kB

    mlock(ptr, MAP_SIZE);

    $ grep Mlock /proc/meminfo
    Mlocked: 4213664 kB

    munlock(ptr, MAP_SIZE);

    $ grep Mlock /proc/meminfo
    Mlocked: 19636 kB

    Signed-off-by: David Rientjes
    Reported-by: Hugh Dickens
    Acked-by: Hugh Dickins
    Reviewed-by: Andrea Arcangeli
    Cc: Naoya Horiguchi
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Reviewed-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
    contiguous memory space.

    This patch makes mlocked pages be migrated out. Of course, it can affect
    realtime processes but in CMA usecase, contiguous memory allocation failing
    is far worse than access latency to an mlocked page being variable while
    CMA is running. If someone wants to make the system realtime, he shouldn't
    enable CMA because stalls can still happen at random times.

    [akpm@linux-foundation.org: tweak comment text, per Mel]
    Signed-off-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Marek Szyprowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • We had thought that pages could no longer get freed while still marked as
    mlocked; but Johannes Weiner posted this program to demonstrate that
    truncating an mlocked private file mapping containing COWed pages is still
    mishandled:

    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(void)
    {
    char *map;
    int fd;

    system("grep mlockfreed /proc/vmstat");
    fd = open("chigurh", O_CREAT|O_EXCL|O_RDWR);
    unlink("chigurh");
    ftruncate(fd, 4096);
    map = mmap(NULL, 4096, PROT_WRITE, MAP_PRIVATE, fd, 0);
    map[0] = 11;
    mlock(map, sizeof(fd));
    ftruncate(fd, 0);
    close(fd);
    munlock(map, sizeof(fd));
    munmap(map, 4096);
    system("grep mlockfreed /proc/vmstat");
    return 0;
    }

    The anon COWed pages are not caught by truncation's clear_page_mlock() of
    the pagecache pages; but unmap_mapping_range() unmaps them, so we ought to
    look out for them there in page_remove_rmap(). Indeed, why should
    truncation or invalidation be doing the clear_page_mlock() when removing
    from pagecache? mlock is a property of mapping in userspace, not a
    property of pagecache: an mlocked unmapped page is nonsensical.

    Reported-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Michel Lespinasse
    Cc: Ying Han
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • page_evictable(page, vma) is an irritant: almost all its callers pass
    NULL for vma. Remove the vma arg and use mlocked_vma_newpage(vma, page)
    explicitly in the couple of places it's needed. But in those places we
    don't even need page_evictable() itself! They're dealing with a freshly
    allocated anonymous page, which has no "mapping" and cannot be mlocked yet.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Michel Lespinasse
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • This is almost entirely based on Rik's previous patches and discussions
    with him about how this might be implemented.

    Order > 0 compaction stops when enough free pages of the correct page
    order have been coalesced. When doing subsequent higher order
    allocations, it is possible for compaction to be invoked many times.

    However, the compaction code always starts out looking for things to
    compact at the start of the zone, and for free pages to compact things to
    at the end of the zone.

    This can cause quadratic behaviour, with isolate_freepages starting at the
    end of the zone each time, even though previous invocations of the
    compaction code already filled up all free memory on that end of the zone.
    This can cause isolate_freepages to take enormous amounts of CPU with
    certain workloads on larger memory systems.

    This patch caches where the migration and free scanner should start from
    on subsequent compaction invocations using the pageblock-skip information.
    When compaction starts it begins from the cached restart points and will
    update the cached restart points until a page is isolated or a pageblock
    is skipped that would have been scanned by synchronous compaction.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Acked-by: Rafael Aquini
    Cc: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When compaction was implemented it was known that scanning could
    potentially be excessive. The ideal was that a counter be maintained for
    each pageblock but maintaining this information would incur a severe
    penalty due to a shared writable cache line. It has reached the point
    where the scanning costs are a serious problem, particularly on
    long-lived systems where a large process starts and allocates a large
    number of THPs at the same time.

    Instead of using a shared counter, this patch adds another bit to the
    pageblock flags called PG_migrate_skip. If a pageblock is scanned by
    either migrate or free scanner and 0 pages were isolated, the pageblock is
    marked to be skipped in the future. When scanning, this bit is checked
    before any scanning takes place and the block skipped if set.

    The main difficulty with a patch like this is "when to ignore the cached
    information?" If it's ignored too often, the scanning rates will still be
    excessive. If the information is too stale then allocations will fail
    that might have otherwise succeeded. In this patch

    o CMA always ignores the information
    o If the migrate and free scanner meet then the cached information will
    be discarded if it's at least 5 seconds since the last time the cache
    was discarded
    o If there are a large number of allocation failures, discard the cache.

    The time-based heuristic is very clumsy but there are few choices for a
    better event. Depending solely on multiple allocation failures still
    allows excessive scanning when THP allocations are failing in quick
    succession due to memory pressure. Waiting until memory pressure is
    relieved would cause compaction to continually fail instead of using
    reclaim/compaction to try allocate the page. The time-based mechanism is
    clumsy but a better option is not obvious.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Acked-by: Rafael Aquini
    Cc: Fengguang Wu
    Cc: Michal Nazarewicz
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Kyungmin Park
    Cc: Mark Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This reverts commit 7db8889ab05b ("mm: have order > 0 compaction start
    off where it left") and commit de74f1cc ("mm: have order > 0 compaction
    start near a pageblock with free pages"). These patches were a good
    idea and tests confirmed that they massively reduced the amount of
    scanning but the implementation is complex and tricky to understand. A
    later patch will cache what pageblocks should be skipped and
    reimplements the concept of compact_cached_free_pfn on top for both
    migration and free scanners.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Acked-by: Rafael Aquini
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • isolate_migratepages_range() might isolate no pages if for example when
    zone->lru_lock is contended and running asynchronous compaction. In this
    case, we should abort compaction, otherwise, compact_zone will run a
    useless loop and make zone->lru_lock is even contended.

    An additional check is added to ensure that cc.migratepages and
    cc.freepages get properly drained whan compaction is aborted.

    [minchan@kernel.org: Putback pages isolated for migration if aborting]
    [akpm@linux-foundation.org: compact_zone_order requires non-NULL arg contended]
    [akpm@linux-foundation.org: make compact_zone_order() require non-NULL arg `contended']
    [minchan@kernel.org: Putback pages isolated for migration if aborting]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Shaohua Li
    Signed-off-by: Mel Gorman
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • * Add ALLOC_CMA alloc flag and pass it to [__]zone_watermark_ok()
    (from Minchan Kim).

    * During watermark check decrease available free pages number by
    free CMA pages number if necessary (unmovable allocations cannot
    use pages from CMA areas).

    Signed-off-by: Bartlomiej Zolnierkiewicz
    Signed-off-by: Kyungmin Park
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bartlomiej Zolnierkiewicz
     
  • Drop clean cache pages instead of migration during alloc_contig_range() to
    minimise allocation latency by reducing the amount of migration that is
    necessary. It's useful for CMA because latency of migration is more
    important than evicting the background process's working set. In
    addition, as pages are reclaimed then fewer free pages for migration
    targets are required so it avoids memory reclaiming to get free pages,
    which is a contributory factor to increased latency.

    I measured elapsed time of __alloc_contig_migrate_range() which migrates
    10M in 40M movable zone in QEMU machine.

    Before - 146ms, After - 7ms

    [akpm@linux-foundation.org: fix nommu build]
    Signed-off-by: Mel Gorman
    Signed-off-by: Minchan Kim
    Reviewed-by: Mel Gorman
    Cc: Marek Szyprowski
    Acked-by: Michal Nazarewicz
    Cc: Rik van Riel
    Tested-by: Kyungmin Park
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Make sure the #endif that terminates the standard #ifndef / #define /
    #endif construct gets labeled, and gets positioned at the end of the file
    as is normally the case.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • While compaction is migrating pages to free up large contiguous blocks
    for allocation it races with other allocation requests that may steal
    these blocks or break them up. This patch alters direct compaction to
    capture a suitable free page as soon as it becomes available to reduce
    this race. It uses similar logic to split_free_page() to ensure that
    watermarks are still obeyed.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

22 Aug, 2012

1 commit

  • Jim Schutt reported a problem that pointed at compaction contending
    heavily on locks. The workload is straight-forward and in his own words;

    The systems in question have 24 SAS drives spread across 3 HBAs,
    running 24 Ceph OSD instances, one per drive. FWIW these servers
    are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160
    Ceph Linux clients doing dd simultaneously to a Ceph file system
    backed by 12 of these servers.

    Early in the test everything looks fine

    procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    31 15 0 287216 576 38606628 0 0 2 1158 2 14 1 3 95 0 0
    27 15 0 225288 576 38583384 0 0 18 2222016 203357 134876 11 56 17 15 0
    28 17 0 219256 576 38544736 0 0 11 2305932 203141 146296 11 49 23 17 0
    6 18 0 215596 576 38552872 0 0 7 2363207 215264 166502 12 45 22 20 0
    22 18 0 226984 576 38596404 0 0 3 2445741 223114 179527 12 43 23 22 0

    and then it goes to pot

    procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    163 8 0 464308 576 36791368 0 0 11 22210 866 536 3 13 79 4 0
    207 14 0 917752 576 36181928 0 0 712 1345376 134598 47367 7 90 1 2 0
    123 12 0 685516 576 36296148 0 0 429 1386615 158494 60077 8 84 5 3 0
    123 12 0 598572 576 36333728 0 0 1107 1233281 147542 62351 7 84 5 4 0
    622 7 0 660768 576 36118264 0 0 557 1345548 151394 59353 7 85 4 3 0
    223 11 0 283960 576 36463868 0 0 46 1107160 121846 33006 6 93 1 1 0

    Note that system CPU usage is very high blocks being written out has
    dropped by 42%. He analysed this with perf and found

    perf record -g -a sleep 10
    perf report --sort symbol --call-graph fractal,5
    34.63% [k] _raw_spin_lock_irqsave
    |
    |--97.30%-- isolate_freepages
    | compaction_alloc
    | unmap_and_move
    | migrate_pages
    | compact_zone
    | compact_zone_order
    | try_to_compact_pages
    | __alloc_pages_direct_compact
    | __alloc_pages_slowpath
    | __alloc_pages_nodemask
    | alloc_pages_vma
    | do_huge_pmd_anonymous_page
    | handle_mm_fault
    | do_page_fault
    | page_fault
    | |
    | |--87.39%-- skb_copy_datagram_iovec
    | | tcp_recvmsg
    | | inet_recvmsg
    | | sock_recvmsg
    | | sys_recvfrom
    | | system_call
    | | __recv
    | | |
    | | --100.00%-- (nil)
    | |
    | --12.61%-- memcpy
    --2.70%-- [...]

    There was other data but primarily it is all showing that compaction is
    contended heavily on the zone->lock and zone->lru_lock.

    commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled
    while isolating pages for migration] noted that it was possible for
    migration to hold the lru_lock for an excessive amount of time. Very
    broadly speaking this patch expands the concept.

    This patch introduces compact_checklock_irqsave() to check if a lock
    is contended or the process needs to be scheduled. If either condition
    is true then async compaction is aborted and the caller is informed.
    The page allocator will fail a THP allocation if compaction failed due
    to contention. This patch also introduces compact_trylock_irqsave()
    which will acquire the lock only if it is not contended and the process
    does not need to schedule.

    Reported-by: Jim Schutt
    Tested-by: Jim Schutt
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

01 Aug, 2012

4 commits

  • Change the skb allocation API to indicate RX usage and use this to fall
    back to the PFMEMALLOC reserve when needed. SKBs allocated from the
    reserve are tagged in skb->pfmemalloc. If an SKB is allocated from the
    reserve and the socket is later found to be unrelated to page reclaim, the
    packet is dropped so that the memory remains available for page reclaim.
    Network protocols are expected to recover from this packet loss.

    [a.p.zijlstra@chello.nl: Ideas taken from various patches]
    [davem@davemloft.net: Use static branches, coding style corrections]
    [sebastian@breakpoint.cc: Avoid unnecessary cast, fix !CONFIG_NET build]
    Signed-off-by: Mel Gorman
    Acked-by: David S. Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When a user or administrator requires swap for their application, they
    create a swap partition and file, format it with mkswap and activate it
    with swapon. Swap over the network is considered as an option in diskless
    systems. The two likely scenarios are when blade servers are used as part
    of a cluster where the form factor or maintenance costs do not allow the
    use of disks and thin clients.

    The Linux Terminal Server Project recommends the use of the Network Block
    Device (NBD) for swap according to the manual at
    https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
    There is also documentation and tutorials on how to setup swap over NBD at
    places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
    nbd-client also documents the use of NBD as swap. Despite this, the fact
    is that a machine using NBD for swap can deadlock within minutes if swap
    is used intensively. This patch series addresses the problem.

    The core issue is that network block devices do not use mempools like
    normal block devices do. As the host cannot control where they receive
    packets from, they cannot reliably work out in advance how much memory
    they might need. Some years ago, Peter Zijlstra developed a series of
    patches that supported swap over an NFS that at least one distribution is
    carrying within their kernels. This patch series borrows very heavily
    from Peter's work to support swapping over NBD as a pre-requisite to
    supporting swap-over-NFS. The bulk of the complexity is concerned with
    preserving memory that is allocated from the PFMEMALLOC reserves for use
    by the network layer which is needed for both NBD and NFS.

    Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
    preserve access to pages allocated under low memory situations
    to callers that are freeing memory.

    Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks

    Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
    reserves without setting PFMEMALLOC.

    Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
    for later use by network packet processing.

    Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required

    Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.

    Patches 7-12 allows network processing to use PFMEMALLOC reserves when
    the socket has been marked as being used by the VM to clean pages. If
    packets are received and stored in pages that were allocated under
    low-memory situations and are unrelated to the VM, the packets
    are dropped.

    Patch 11 reintroduces __skb_alloc_page which the networking
    folk may object to but is needed in some cases to propogate
    pfmemalloc from a newly allocated page to an skb. If there is a
    strong objection, this patch can be dropped with the impact being
    that swap-over-network will be slower in some cases but it should
    not fail.

    Patch 13 is a micro-optimisation to avoid a function call in the
    common case.

    Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
    PFMEMALLOC if necessary.

    Patch 15 notes that it is still possible for the PFMEMALLOC reserve
    to be depleted. To prevent this, direct reclaimers get throttled on
    a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is
    expected that kswapd and the direct reclaimers already running
    will clean enough pages for the low watermark to be reached and
    the throttled processes are woken up.

    Patch 16 adds a statistic to track how often processes get throttled

    Some basic performance testing was run using kernel builds, netperf on
    loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
    sysbench. Each of them were expected to use the sl*b allocators
    reasonably heavily but there did not appear to be significant performance
    variances.

    For testing swap-over-NBD, a machine was booted with 2G of RAM with a
    swapfile backed by NBD. 8*NUM_CPU processes were started that create
    anonymous memory mappings and read them linearly in a loop. The total
    size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
    memory pressure.

    Without the patches and using SLUB, the machine locks up within minutes
    and runs to completion with them applied. With SLAB, the story is
    different as an unpatched kernel run to completion. However, the patched
    kernel completed the test 45% faster.

    MICRO
    3.5.0-rc2 3.5.0-rc2
    vanilla swapnbd
    Unrecognised test vmscan-anon-mmap-write
    MMTests Statistics: duration
    Sys Time Running Test (seconds) 197.80 173.07
    User+Sys Time Running Test (seconds) 206.96 182.03
    Total Elapsed Time (seconds) 3240.70 1762.09

    This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages

    Allocations of pages below the min watermark run a risk of the machine
    hanging due to a lack of memory. To prevent this, only callers who have
    PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
    allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to
    a slab though, nothing prevents other callers consuming free objects
    within those slabs. This patch limits access to slab pages that were
    alloced from the PFMEMALLOC reserves.

    When this patch is applied, pages allocated from below the low watermark
    are returned with page->pfmemalloc set and it is up to the caller to
    determine how the page should be protected. SLAB restricts access to any
    page with page->pfmemalloc set to callers which are known to able to
    access the PFMEMALLOC reserve. If one is not available, an attempt is
    made to allocate a new page rather than use a reserve. SLUB is a bit more
    relaxed in that it only records if the current per-CPU page was allocated
    from PFMEMALLOC reserve and uses another partial slab if the caller does
    not have the necessary GFP or process flags. This was found to be
    sufficient in tests to avoid hangs due to SLUB generally maintaining
    smaller lists than SLAB.

    In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
    a slab allocation even though free objects are available because they are
    being preserved for callers that are freeing pages.

    [a.p.zijlstra@chello.nl: Original implementation]
    [sebastian@breakpoint.cc: Correct order of page flag clearing]
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • On architectures with CONFIG_HUGETLB_PAGE_SIZE_VARIABLE set, such as
    Itanium, pageblock_order is a variable with default value of 0. It's set
    to the right value by set_pageblock_order() in function
    free_area_init_core().

    But pageblock_order may be used by sparse_init() before free_area_init_core()
    is called along path:
    sparse_init()
    ->sparse_early_usemaps_alloc_node()
    ->usemap_size()
    ->SECTION_BLOCKFLAGS_BITS
    ->((1UL << (PFN_SECTION_SHIFT - pageblock_order)) *
    NR_PAGEBLOCK_BITS)

    The uninitialized pageblock_size will cause memory wasting because
    usemap_size() returns a much bigger value then it's really needed.

    For example, on an Itanium platform,
    sparse_init() pageblock_order=0 usemap_size=24576
    free_area_init_core() before pageblock_order=0, usemap_size=24576
    free_area_init_core() after pageblock_order=12, usemap_size=8

    That means 24K memory has been wasted for each section, so fix it by calling
    set_pageblock_order() from sparse_init().

    Signed-off-by: Xishi Qiu
    Signed-off-by: Jiang Liu
    Cc: Tony Luck
    Cc: Yinghai Lu
    Cc: KAMEZAWA Hiroyuki
    Cc: Benjamin Herrenschmidt
    Cc: KOSAKI Motohiro
    Cc: David Rientjes
    Cc: Keping Chen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • Order > 0 compaction stops when enough free pages of the correct page
    order have been coalesced. When doing subsequent higher order
    allocations, it is possible for compaction to be invoked many times.

    However, the compaction code always starts out looking for things to
    compact at the start of the zone, and for free pages to compact things to
    at the end of the zone.

    This can cause quadratic behaviour, with isolate_freepages starting at the
    end of the zone each time, even though previous invocations of the
    compaction code already filled up all free memory on that end of the zone.

    This can cause isolate_freepages to take enormous amounts of CPU with
    certain workloads on larger memory systems.

    The obvious solution is to have isolate_freepages remember where it left
    off last time, and continue at that point the next time it gets invoked
    for an order > 0 compaction. This could cause compaction to fail if
    cc->free_pfn and cc->migrate_pfn are close together initially, in that
    case we restart from the end of the zone and try once more.

    Forced full (order == -1) compactions are left alone.

    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: s/laste/last/, use 80 cols]
    Signed-off-by: Rik van Riel
    Reported-by: Jim Schutt
    Tested-by: Jim Schutt
    Cc: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

04 Jun, 2012

1 commit

  • This reverts commit 5ceb9ce6fe9462a298bb2cd5c9f1ca6cb80a0199.

    That commit seems to be the cause of the mm compation list corruption
    issues that Dave Jones reported. The locking (or rather, absense
    there-of) is dubious, as is the use of the 'page' variable once it has
    been found to be outside the pageblock range.

    So revert it for now, we can re-visit this for 3.6. If we even need to:
    as Minchan Kim says, "The patch wasn't a bug fix and even test workload
    was very theoretical".

    Reported-and-tested-by: Dave Jones
    Acked-by: Hugh Dickins
    Acked-by: KOSAKI Motohiro
    Acked-by: Minchan Kim
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Kyungmin Park
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

02 Jun, 2012

1 commit

  • Pull vfs changes from Al Viro.
    "A lot of misc stuff. The obvious groups:
    * Miklos' atomic_open series; kills the damn abuse of
    ->d_revalidate() by NFS, which was the major stumbling block for
    all work in that area.
    * ripping security_file_mmap() and dealing with deadlocks in the
    area; sanitizing the neighborhood of vm_mmap()/vm_munmap() in
    general.
    * ->encode_fh() switched to saner API; insane fake dentry in
    mm/cleancache.c gone.
    * assorted annotations in fs (endianness, __user)
    * parts of Artem's ->s_dirty work (jff2 and reiserfs parts)
    * ->update_time() work from Josef.
    * other bits and pieces all over the place.

    Normally it would've been in two or three pull requests, but
    signal.git stuff had eaten a lot of time during this cycle ;-/"

    Fix up trivial conflicts in Documentation/filesystems/vfs.txt (the
    'truncate_range' inode method was removed by the VM changes, the VFS
    update adds an 'update_time()' method), and in fs/btrfs/ulist.[ch] (due
    to sparse fix added twice, with other changes nearby).

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (95 commits)
    nfs: don't open in ->d_revalidate
    vfs: retry last component if opening stale dentry
    vfs: nameidata_to_filp(): don't throw away file on error
    vfs: nameidata_to_filp(): inline __dentry_open()
    vfs: do_dentry_open(): don't put filp
    vfs: split __dentry_open()
    vfs: do_last() common post lookup
    vfs: do_last(): add audit_inode before open
    vfs: do_last(): only return EISDIR for O_CREAT
    vfs: do_last(): check LOOKUP_DIRECTORY
    vfs: do_last(): make ENOENT exit RCU safe
    vfs: make follow_link check RCU safe
    vfs: do_last(): use inode variable
    vfs: do_last(): inline walk_component()
    vfs: do_last(): make exit RCU safe
    vfs: split do_lookup()
    Btrfs: move over to use ->update_time
    fs: introduce inode operation ->update_time
    reiserfs: get rid of resierfs_sync_super
    reiserfs: mark the superblock as dirty a bit later
    ...

    Linus Torvalds
     

01 Jun, 2012

1 commit

  • take it to mm/util.c, convert vm_mmap() to use of that one and
    take it to mm/util.c as well, convert both sys_mmap_pgoff() to
    use of vm_mmap_pgoff()

    Signed-off-by: Al Viro

    Al Viro
     

30 May, 2012

2 commits

  • When MIGRATE_UNMOVABLE pages are freed from MIGRATE_UNMOVABLE type
    pageblock (and some MIGRATE_MOVABLE pages are left in it) waiting until an
    allocation takes ownership of the block may take too long. The type of
    the pageblock remains unchanged so the pageblock cannot be used as a
    migration target during compaction.

    Fix it by:

    * Adding enum compact_mode (COMPACT_ASYNC_[MOVABLE,UNMOVABLE], and
    COMPACT_SYNC) and then converting sync field in struct compact_control
    to use it.

    * Adding nr_pageblocks_skipped field to struct compact_control and
    tracking how many destination pageblocks were of MIGRATE_UNMOVABLE type.
    If COMPACT_ASYNC_MOVABLE mode compaction ran fully in
    try_to_compact_pages() (COMPACT_COMPLETE) it implies that there is not a
    suitable page for allocation. In this case then check how if there were
    enough MIGRATE_UNMOVABLE pageblocks to try a second pass in
    COMPACT_ASYNC_UNMOVABLE mode.

    * Scanning the MIGRATE_UNMOVABLE pageblocks (during COMPACT_SYNC and
    COMPACT_ASYNC_UNMOVABLE compaction modes) and building a count based on
    finding PageBuddy pages, page_count(page) == 0 or PageLRU pages. If all
    pages within the MIGRATE_UNMOVABLE pageblock are in one of those three
    sets change the whole pageblock type to MIGRATE_MOVABLE.

    My particular test case (on a ARM EXYNOS4 device with 512 MiB, which means
    131072 standard 4KiB pages in 'Normal' zone) is to:

    - allocate 120000 pages for kernel's usage
    - free every second page (60000 pages) of memory just allocated
    - allocate and use 60000 pages from user space
    - free remaining 60000 pages of kernel memory
    (now we have fragmented memory occupied mostly by user space pages)
    - try to allocate 100 order-9 (2048 KiB) pages for kernel's usage

    The results:
    - with compaction disabled I get 11 successful allocations
    - with compaction enabled - 14 successful allocations
    - with this patch I'm able to get all 100 successful allocations

    NOTE: If we can make kswapd aware of order-0 request during compaction, we
    can enhance kswapd with changing mode to COMPACT_ASYNC_FULL
    (COMPACT_ASYNC_MOVABLE + COMPACT_ASYNC_UNMOVABLE). Please see the
    following thread:

    http://marc.info/?l=linux-mm&m=133552069417068&w=2

    [minchan@kernel.org: minor cleanups]
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Marek Szyprowski
    Signed-off-by: Bartlomiej Zolnierkiewicz
    Signed-off-by: Kyungmin Park
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bartlomiej Zolnierkiewicz
     
  • Andrew pointed out that the is_mlocked_vma() is misnamed. A function
    with name like that would expect bool return and no side-effects.

    Since it is called on the fault path for new page, rename it in this
    patch.

    Signed-off-by: Ying Han
    Reviewed-by: Rik van Riel
    Acked-by: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    [akpm@linux-foundation.org: s/mlock_vma_newpage/mlock_vma_newpage/, per Minchan]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     

21 May, 2012

1 commit

  • This commit exports some of the functions from compaction.c file
    outside of it adding their declaration into internal.h header
    file so that other mm related code can use them.

    This forced compaction.c to always be compiled (as opposed to being
    compiled only if CONFIG_COMPACTION is defined) but as to avoid
    introducing code that user did not ask for, part of the compaction.c
    is now wrapped in on #ifdef.

    Signed-off-by: Michal Nazarewicz
    Signed-off-by: Marek Szyprowski
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Tested-by: Rob Clark
    Tested-by: Ohad Ben-Cohen
    Tested-by: Benjamin Gaignard
    Tested-by: Robert Nelson
    Tested-by: Barry Song

    Michal Nazarewicz
     

03 Nov, 2011

1 commit

  • Michel while working on the working set estimation code, noticed that
    calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
    wasn't safe, if the pfn ended up being a tail page of a transparent
    hugepage under splitting by __split_huge_page_refcount().

    He then found the problem could also theoretically materialize with
    page_cache_get_speculative() during the speculative radix tree lookups
    that uses get_page_unless_zero() in SMP if the radix tree page is freed
    and reallocated and get_user_pages is called on it before
    page_cache_get_speculative has a chance to call get_page_unless_zero().

    So the best way to fix the problem is to keep page_tail->_count zero at
    all times. This will guarantee that get_page_unless_zero() can never
    succeed on any tail page. page_tail->_mapcount is guaranteed zero and
    is unused for all tail pages of a compound page, so we can simply
    account the tail page references there and transfer them to
    tail_page->_count in __split_huge_page_refcount() (in addition to the
    head_page->_mapcount).

    While debugging this s/_count/_mapcount/ change I also noticed get_page is
    called by direct-io.c on pages returned by get_user_pages. That wasn't
    entirely safe because the two atomic_inc in get_page weren't atomic. As
    opposed to other get_user_page users like secondary-MMU page fault to
    establish the shadow pagetables would never call any superflous get_page
    after get_user_page returns. It's safer to make get_page universally safe
    for tail pages and to use get_page_foll() within follow_page (inside
    get_user_pages()). get_page_foll() is safe to do the refcounting for tail
    pages without taking any locks because it is run within PT lock protected
    critical sections (PT lock for pte and page_table_lock for
    pmd_trans_huge).

    The standard get_page() as invoked by direct-io instead will now take
    the compound_lock but still only for tail pages. The direct-io paths
    are usually I/O bound and the compound_lock is per THP so very
    finegrined, so there's no risk of scalability issues with it. A simple
    direct-io benchmarks with all lockdep prove locking and spinlock
    debugging infrastructure enabled shows identical performance and no
    overhead. So it's worth it. Ideally direct-io should stop calling
    get_page() on pages returned by get_user_pages(). The spinlock in
    get_page() is already optimized away for no-THP builds but doing
    get_page() on tail pages returned by GUP is generally a rare operation
    and usually only run in I/O paths.

    This new refcounting on page_tail->_mapcount in addition to avoiding new
    RCU critical sections will also allow the working set estimation code to
    work without any further complexity associated to the tail page
    refcounting with THP.

    Signed-off-by: Andrea Arcangeli
    Reported-by: Michel Lespinasse
    Reviewed-by: Michel Lespinasse
    Reviewed-by: Minchan Kim
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Benjamin Herrenschmidt
    Cc: David Gibson
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

25 May, 2011

1 commit

  • When I was reading nommu code, I found that it handles the vma list/tree
    in an unusual way. IIUC, because there can be more than one
    identical/overrapped vmas in the list/tree, it sorts the tree more
    strictly and does a linear search on the tree. But it doesn't applied to
    the list (i.e. the list could be constructed in a different order than
    the tree so that we can't use the list when finding the first vma in that
    order).

    Since inserting/sorting a vma in the tree and link is done at the same
    time, we can easily construct both of them in the same order. And linear
    searching on the tree could be more costly than doing it on the list, it
    can be converted to use the list.

    Also, after the commit 297c5eee3724 ("mm: make the vma list be doubly
    linked") made the list be doubly linked, there were a couple of code need
    to be fixed to construct the list properly.

    Patch 1/6 is a preparation. It maintains the list sorted same as the tree
    and construct doubly-linked list properly. Patch 2/6 is a simple
    optimization for the vma deletion. Patch 3/6 and 4/6 convert tree
    traversal to list traversal and the rest are simple fixes and cleanups.

    This patch:

    @vma added into @mm should be sorted by start addr, end addr and VMA
    struct addr in that order because we may get identical VMAs in the @mm.
    However this was true only for the rbtree, not for the list.

    This patch fixes this by remembering 'rb_prev' during the tree traversal
    like find_vma_prepare() does and linking the @vma via __vma_link_list().
    After this patch, we can iterate the whole VMAs in correct order simply by
    using @mm->mmap list.

    [akpm@linux-foundation.org: avoid duplicating __vma_link_list()]
    Signed-off-by: Namhyung Kim
    Acked-by: Greg Ungerer
    Cc: David Howells
    Cc: Paul Mundt
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     

31 Mar, 2011

1 commit


18 Mar, 2011

1 commit

  • In most cases, get_user_pages and get_user_pages_fast should be used
    to pin user pages in memory. But sometimes, some special flags except
    FOLL_GET, FOLL_WRITE and FOLL_FORCE are needed, for example in
    following patch, KVM needs FOLL_HWPOISON. To support these users,
    __get_user_pages is exported directly.

    There are some symbol name conflicts in infiniband driver, fixed them too.

    Signed-off-by: Huang Ying
    CC: Andrew Morton
    CC: Michel Lespinasse
    CC: Roland Dreier
    CC: Ralph Campbell
    Signed-off-by: Marcelo Tosatti

    Huang Ying
     

18 Jan, 2011

1 commit

  • This reverts commit 744ed1442757767ffede5008bb13e0805085902e.

    Chris Mason ended up chasing down some page allocation errors and pages
    stuck waiting on the IO scheduler, and was able to narrow it down to two
    commits: commit 744ed1442757 ("mm: batch activate_page() to reduce lock
    contention") and d8505dee1a87 ("mm: simplify code of swap.c").

    This reverts the first of them.

    Reported-and-debugged-by: Chris Mason
    Cc: Mel Gorman
    Cc: Andrew Morton
    Cc: Jens Axboe
    Cc: linux-mm
    Cc: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Cc: Shaohua Li
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

14 Jan, 2011

3 commits

  • The zone->lru_lock is heavily contented in workload where activate_page()
    is frequently used. We could do batch activate_page() to reduce the lock
    contention. The batched pages will be added into zone list when the pool
    is full or page reclaim is trying to drain them.

    For example, in a 4 socket 64 CPU system, create a sparse file and 64
    processes, processes shared map to the file. Each process read access the
    whole file and then exit. The process exit will do unmap_vmas() and cause
    a lot of activate_page() call. In such workload, we saw about 58% total
    time reduction with below patch. Other workloads with a lot of
    activate_page also benefits a lot too.

    I tested some microbenchmarks:
    case-anon-cow-rand-mt 0.58%
    case-anon-cow-rand -3.30%
    case-anon-cow-seq-mt -0.51%
    case-anon-cow-seq -5.68%
    case-anon-r-rand-mt 0.23%
    case-anon-r-rand 0.81%
    case-anon-r-seq-mt -0.71%
    case-anon-r-seq -1.99%
    case-anon-rx-rand-mt 2.11%
    case-anon-rx-seq-mt 3.46%
    case-anon-w-rand-mt -0.03%
    case-anon-w-rand -0.50%
    case-anon-w-seq-mt -1.08%
    case-anon-w-seq -0.12%
    case-anon-wx-rand-mt -5.02%
    case-anon-wx-seq-mt -1.43%
    case-fork 1.65%
    case-fork-sleep -0.07%
    case-fork-withmem 1.39%
    case-hugetlb -0.59%
    case-lru-file-mmap-read-mt -0.54%
    case-lru-file-mmap-read 0.61%
    case-lru-file-mmap-read-rand -2.24%
    case-lru-file-readonce -0.64%
    case-lru-file-readtwice -11.69%
    case-lru-memcg -1.35%
    case-mmap-pread-rand-mt 1.88%
    case-mmap-pread-rand -15.26%
    case-mmap-pread-seq-mt 0.89%
    case-mmap-pread-seq -69.72%
    case-mmap-xread-rand-mt 0.71%
    case-mmap-xread-seq-mt 0.38%

    The most significent are:
    case-lru-file-readtwice -11.69%
    case-mmap-pread-rand -15.26%
    case-mmap-pread-seq -69.72%

    which use activate_page a lot. others are basically variations because
    each run has slightly difference.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Shaohua Li
    Cc: Andi Kleen
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Lately I've been working to make KVM use hugepages transparently without
    the usual restrictions of hugetlbfs. Some of the restrictions I'd like to
    see removed:

    1) hugepages have to be swappable or the guest physical memory remains
    locked in RAM and can't be paged out to swap

    2) if a hugepage allocation fails, regular pages should be allocated
    instead and mixed in the same vma without any failure and without
    userland noticing

    3) if some task quits and more hugepages become available in the
    buddy, guest physical memory backed by regular pages should be
    relocated on hugepages automatically in regions under
    madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
    kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
    not null)

    4) avoidance of reservation and maximization of use of hugepages whenever
    possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
    1 machine with 1 database with 1 database cache with 1 database cache size
    known at boot time. It's definitely not feasible with a virtualization
    hypervisor usage like RHEV-H that runs an unknown number of virtual machines
    with an unknown size of each virtual machine with an unknown amount of
    pagecache that could be potentially useful in the host for guest not using
    O_DIRECT (aka cache=off).

    hugepages in the virtualization hypervisor (and also in the guest!) are
    much more important than in a regular host not using virtualization,
    becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24
    to 19 in case only the hypervisor uses transparent hugepages, and they
    decrease the tlb-miss cacheline accesses from 19 to 15 in case both the
    linux hypervisor and the linux guest both uses this patch (though the
    guest will limit the addition speedup to anonymous regions only for
    now...). Even more important is that the tlb miss handler is much slower
    on a NPT/EPT guest than for a regular shadow paging or no-virtualization
    scenario. So maximizing the amount of virtual memory cached by the TLB
    pays off significantly more with NPT/EPT than without (even if there would
    be no significant speedup in the tlb-miss runtime).

    The first (and more tedious) part of this work requires allowing the VM to
    handle anonymous hugepages mixed with regular pages transparently on
    regular anonymous vmas. This is what this patch tries to achieve in the
    least intrusive possible way. We want hugepages and hugetlb to be used in
    a way so that all applications can benefit without changes (as usual we
    leverage the KVM virtualization design: by improving the Linux VM at
    large, KVM gets the performance boost too).

    The most important design choice is: always fallback to 4k allocation if
    the hugepage allocation fails! This is the _very_ opposite of some large
    pagecache patches that failed with -EIO back then if a 64k (or similar)
    allocation failed...

    Second important decision (to reduce the impact of the feature on the
    existing pagetable handling code) is that at any time we can split an
    hugepage into 512 regular pages and it has to be done with an operation
    that can't fail. This way the reliability of the swapping isn't decreased
    (no need to allocate memory when we are short on memory to swap) and it's
    trivial to plug a split_huge_page* one-liner where needed without
    polluting the VM. Over time we can teach mprotect, mremap and friends to
    handle pmd_trans_huge natively without calling split_huge_page*. The fact
    it can't fail isn't just for swap: if split_huge_page would return -ENOMEM
    (instead of the current void) we'd need to rollback the mprotect from the
    middle of it (ideally including undoing the split_vma) which would be a
    big change and in the very wrong direction (it'd likely be simpler not to
    call split_huge_page at all and to teach mprotect and friends to handle
    hugepages instead of rolling them back from the middle). In short the
    very value of split_huge_page is that it can't fail.

    The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and
    incremental and it'll just be an "harmless" addition later if this initial
    part is agreed upon. It also should be noted that locking-wise replacing
    regular pages with hugepages is going to be very easy if compared to what
    I'm doing below in split_huge_page, as it will only happen when
    page_count(page) matches page_mapcount(page) if we can take the PG_lock
    and mmap_sem in write mode. collapse_huge_page will be a "best effort"
    that (unlike split_huge_page) can fail at the minimal sign of trouble and
    we can try again later. collapse_huge_page will be similar to how KSM
    works and the madvise(MADV_HUGEPAGE) will work similar to
    madvise(MADV_MERGEABLE).

    The default I like is that transparent hugepages are used at page fault
    time. This can be changed with
    /sys/kernel/mm/transparent_hugepage/enabled. The control knob can be set
    to three values "always", "madvise", "never" which mean respectively that
    hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions,
    or never used. /sys/kernel/mm/transparent_hugepage/defrag instead
    controls if the hugepage allocation should defrag memory aggressively
    "always", only inside "madvise" regions, or "never".

    The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
    put_page (from get_user_page users that can't use mmu notifier like
    O_DIRECT) that runs against a __split_huge_page_refcount instead was a
    pain to serialize in a way that would result always in a coherent page
    count for both tail and head. I think my locking solution with a
    compound_lock taken only after the page_first is valid and is still a
    PageHead should be safe but it surely needs review from SMP race point of
    view. In short there is no current existing way to serialize the O_DIRECT
    final put_page against split_huge_page_refcount so I had to invent a new
    one (O_DIRECT loses knowledge on the mapping status by the time gup_fast
    returns so...). And I didn't want to impact all gup/gup_fast users for
    now, maybe if we change the gup interface substantially we can avoid this
    locking, I admit I didn't think too much about it because changing the gup
    unpinning interface would be invasive.

    If we ignored O_DIRECT we could stick to the existing compound refcounting
    code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM
    (and any other mmu notifier user) would call it without FOLL_GET (and if
    FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the
    current task mmu notifier list yet). But O_DIRECT is fundamental for
    decent performance of virtualized I/O on fast storage so we can't avoid it
    to solve the race of put_page against split_huge_page_refcount to achieve
    a complete hugepage feature for KVM.

    Swap and oom works fine (well just like with regular pages ;). MMU
    notifier is handled transparently too, with the exception of the young bit
    on the pmd, that didn't have a range check but I think KVM will be fine
    because the whole point of hugepages is that EPT/NPT will also use a huge
    pmd when they notice gup returns pages with PageCompound set, so they
    won't care of a range and there's just the pmd young bit to check in that
    case.

    NOTE: in some cases if the L2 cache is small, this may slowdown and waste
    memory during COWs because 4M of memory are accessed in a single fault
    instead of 8k (the payoff is that after COW the program can run faster).
    So we might want to switch the copy_huge_page (and clear_huge_page too) to
    not temporal stores. I also extensively researched ways to avoid this
    cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k
    up to 1M (I can send those patches that fully implemented prefault) but I
    concluded they're not worth it and they add an huge additional complexity
    and they remove all tlb benefits until the full hugepage has been faulted
    in, to save a little bit of memory and some cache during app startup, but
    they still don't improve substantially the cache-trashing during startup
    if the prefault happens in >4k chunks. One reason is that those 4k pte
    entries copied are still mapped on a perfectly cache-colored hugepage, so
    the trashing is the worst one can generate in those copies (cow of 4k page
    copies aren't so well colored so they trashes less, but again this results
    in software running faster after the page fault). Those prefault patches
    allowed things like a pte where post-cow pages were local 4k regular anon
    pages and the not-yet-cowed pte entries were pointing in the middle of
    some hugepage mapped read-only. If it doesn't payoff substantially with
    todays hardware it will payoff even less in the future with larger l2
    caches, and the prefault logic would blot the VM a lot. If one is
    emebdded transparent_hugepage can be disabled during boot with sysfs or
    with the boot commandline parameter transparent_hugepage=0 (or
    transparent_hugepage=2 to restrict hugepages inside madvise regions) that
    will ensure not a single hugepage is allocated at boot time. It is simple
    enough to just disable transparent hugepage globally and let transparent
    hugepages be allocated selectively by applications in the MADV_HUGEPAGE
    region (both at page fault time, and if enabled with the
    collapse_huge_page too through the kernel daemon).

    This patch supports only hugepages mapped in the pmd, archs that have
    smaller hugepages will not fit in this patch alone. Also some archs like
    power have certain tlb limits that prevents mixing different page size in
    the same regions so they will not fit in this framework that requires
    "graceful fallback" to basic PAGE_SIZE in case of physical memory
    fragmentation. hugetlbfs remains a perfect fit for those because its
    software limits happen to match the hardware limits. hugetlbfs also
    remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped
    to be found not fragmented after a certain system uptime and that would be
    very expensive to defragment with relocation, so requiring reservation.
    hugetlbfs is the "reservation way", the point of transparent hugepages is
    not to have any reservation at all and maximizing the use of cache and
    hugepages at all times automatically.

    Some performance result:

    vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
    ages3
    memset page fault 1566023
    memset tlb miss 453854
    memset second tlb miss 453321
    random access tlb miss 41635
    random access second tlb miss 41658
    vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
    memset page fault 1566471
    memset tlb miss 453375
    memset second tlb miss 453320
    random access tlb miss 41636
    random access second tlb miss 41637
    vmx andrea # ./largepages3
    memset page fault 1566642
    memset tlb miss 453417
    memset second tlb miss 453313
    random access tlb miss 41630
    random access second tlb miss 41647
    vmx andrea # ./largepages3
    memset page fault 1566872
    memset tlb miss 453418
    memset second tlb miss 453315
    random access tlb miss 41618
    random access second tlb miss 41659
    vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
    vmx andrea # ./largepages3
    memset page fault 2182476
    memset tlb miss 460305
    memset second tlb miss 460179
    random access tlb miss 44483
    random access second tlb miss 44186
    vmx andrea # ./largepages3
    memset page fault 2182791
    memset tlb miss 460742
    memset second tlb miss 459962
    random access tlb miss 43981
    random access second tlb miss 43988

    ============
    #include
    #include
    #include
    #include

    #define SIZE (3UL*1024*1024*1024)

    int main()
    {
    char *p = malloc(SIZE), *p2;
    struct timeval before, after;

    gettimeofday(&before, NULL);
    memset(p, 0, SIZE);
    gettimeofday(&after, NULL);
    printf("memset page fault %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    memset(p, 0, SIZE);
    gettimeofday(&after, NULL);
    printf("memset tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    memset(p, 0, SIZE);
    gettimeofday(&after, NULL);
    printf("memset second tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    for (p2 = p; p2 < p+SIZE; p2 += 4096)
    *p2 = 0;
    gettimeofday(&after, NULL);
    printf("random access tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    for (p2 = p; p2 < p+SIZE; p2 += 4096)
    *p2 = 0;
    gettimeofday(&after, NULL);
    printf("random access second tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    return 0;
    }
    ============

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • __get_user_pages gets a new 'nonblocking' parameter to signal that the
    caller is prepared to re-acquire mmap_sem and retry the operation if
    needed. This is used to split off long operations if they are going to
    block on a disk transfer, or when we detect contention on the mmap_sem.

    [akpm@linux-foundation.org: remove ref to rwsem_is_contended()]
    Signed-off-by: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

27 Oct, 2010

1 commit

  • page_order() is called by memory hotplug's user interface to check the
    section is removable or not. (is_mem_section_removable())

    It calls page_order() withoug holding zone->lock.
    So, even if the caller does

    if (PageBuddy(page))
    ret = page_order(page) ...
    The caller may hit BUG_ON().

    For fixing this, there are 2 choices.
    1. add zone->lock.
    2. remove BUG_ON().

    is_mem_section_removable() is used for some "advice" and doesn't need to
    be 100% accurate. This is_removable() can be called via user program..
    We don't want to take this important lock for long by user's request. So,
    this patch removes BUG_ON().

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Wu Fengguang
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

16 Dec, 2009

2 commits

  • In some use cases, user doesn't need extra filtering. E.g. user program
    can inject errors through madvise syscall to its own pages, however it
    might not know what the page state exactly is or which inode the page
    belongs to.

    So introduce an one-off interface "corrupt-filter-enable".

    Echo 0 to switch off page filters, and echo 1 to switch on the filters.
    [AK: changed default to 0]

    Signed-off-by: Haicheng Li
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andi Kleen

    Haicheng Li
     
  • The hwpoison test suite need to inject hwpoison to a collection of
    selected task pages, and must not touch pages not owned by them and
    thus kill important system processes such as init. (But it's OK to
    mis-hwpoison free/unowned pages as well as shared clean pages.
    Mis-hwpoison of shared dirty pages will kill all tasks, so the test
    suite will target all or non of such tasks in the first place.)

    The memory cgroup serves this purpose well. We can put the target
    processes under the control of a memory cgroup, and tell the hwpoison
    injection code to only kill pages associated with some active memory
    cgroup.

    The prerequisite for doing hwpoison stress tests with mem_cgroup is,
    the mem_cgroup code tracks task pages _accurately_ (unless page is
    locked). Which we believe is/should be true.

    The benefits are simplification of hwpoison injector code. Also the
    mem_cgroup code will automatically be tested by hwpoison test cases.

    The alternative interfaces pin-pfn/unpin-pfn can also delegate the
    (process and page flags) filtering functions reliably to user space.
    However prototype implementation shows that this scheme adds more
    complexity than we wanted.

    Example test case:

    mkdir /cgroup/hwpoison

    usemem -m 100 -s 1000 &
    echo `jobs -p` > /cgroup/hwpoison/tasks

    memcg_ino=$(ls -id /cgroup/hwpoison | cut -f1 -d' ')
    echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg

    page-types -p `pidof init` --hwpoison # shall do nothing
    page-types -p `pidof usemem` --hwpoison # poison its pages

    [AK: Fix documentation]
    [Add fix for problem noticed by Li Zefan ;
    dentry in the css could be NULL]

    CC: KOSAKI Motohiro
    CC: Hugh Dickins
    CC: Daisuke Nishimura
    CC: Balbir Singh
    CC: KAMEZAWA Hiroyuki
    CC: Li Zefan
    CC: Paul Menage
    CC: Nick Piggin
    CC: Andi Kleen
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andi Kleen

    Andi Kleen