10 Jul, 2013

1 commit


19 Dec, 2012

2 commits

  • When a process tries to allocate a page with the __GFP_KMEMCG flag, the
    page allocator will call the corresponding memcg functions to validate
    the allocation. Tasks in the root memcg can always proceed.

    To avoid adding markers to the page - and a kmem flag that would
    necessarily follow, as much as doing page_cgroup lookups for no reason,
    whoever is marking its allocations with __GFP_KMEMCG flag is responsible
    for telling the page allocator that this is such an allocation at
    free_pages() time. This is done by the invocation of
    __free_accounted_pages() and free_accounted_pages().

    Signed-off-by: Glauber Costa
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: Kamezawa Hiroyuki
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Johannes Weiner
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: JoonSoo Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • This flag is used to indicate to the callees that this allocation is a
    kernel allocation in process context, and should be accounted to current's
    memcg.

    Signed-off-by: Glauber Costa
    Acked-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Acked-by: Kamezawa Hiroyuki
    Acked-by: Michal Hocko
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: JoonSoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     

13 Dec, 2012

1 commit


12 Dec, 2012

1 commit


11 Dec, 2012

1 commit

  • This reverts commits a50915394f1fc02c2861d3b7ce7014788aa5066e and
    d7c3b937bdf45f0b844400b7bf6fd3ed50bac604.

    This is a revert of a revert of a revert. In addition, it reverts the
    even older i915 change to stop using the __GFP_NO_KSWAPD flag due to the
    original commits in linux-next.

    It turns out that the original patch really was bogus, and that the
    original revert was the correct thing to do after all. We thought we
    had fixed the problem, and then reverted the revert, but the problem
    really is fundamental: waking up kswapd simply isn't the right thing to
    do, and direct reclaim sometimes simply _is_ the right thing to do.

    When certain allocations fail, we simply should try some direct reclaim,
    and if that fails, fail the allocation. That's the right thing to do
    for THP allocations, which can easily fail, and the GPU allocations want
    to do that too.

    So starting kswapd is sometimes simply wrong, and removing the flag that
    said "don't start kswapd" was a mistake. Let's hope we never revisit
    this mistake again - and certainly not this many times ;)

    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Rik van Riel
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Dec, 2012

1 commit

  • It apepars that this patch was innocent, and we hope that "mm: avoid
    waking kswapd for THP allocations when compaction is deferred or
    contended" will fix the final kswapd-spinning cause.

    Cc: Zdenek Kabelac
    Cc: Seth Jennings
    Cc: Valdis Kletnieks
    Cc: Jiri Slaby
    Cc: Rik van Riel
    Cc: Robert Jennings
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

27 Nov, 2012

1 commit

  • With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction
    based on failures" reverted, Zdenek Kabelac reported the following

    Hmm, so it's just took longer to hit the problem and observe
    kswapd0 spinning on my CPU again - it's not as endless like before -
    but still it easily eats minutes - it helps to turn off Firefox
    or TB (memory hungry apps) so kswapd0 stops soon - and restart
    those apps again. (And I still have like >1GB of cached memory)

    kswapd0 R running task 0 30 2 0x00000000
    Call Trace:
    preempt_schedule+0x42/0x60
    _raw_spin_unlock+0x55/0x60
    put_super+0x31/0x40
    drop_super+0x22/0x30
    prune_super+0x149/0x1b0
    shrink_slab+0xba/0x510

    The sysrq+m indicates the system has no swap so it'll never reclaim
    anonymous pages as part of reclaim/compaction. That is one part of the
    problem but not the root cause as file-backed pages could also be
    reclaimed.

    The likely underlying problem is that kswapd is woken up or kept awake
    for each THP allocation request in the page allocator slow path.

    If compaction fails for the requesting process then compaction will be
    deferred for a time and direct reclaim is avoided. However, if there
    are a storm of THP requests that are simply rejected, it will still be
    the the case that kswapd is awake for a prolonged period of time as
    pgdat->kswapd_max_order is updated each time. This is noticed by the
    main kswapd() loop and it will not call kswapd_try_to_sleep(). Instead
    it will loopp, shrinking a small number of pages and calling
    shrink_slab() on each iteration.

    The temptation is to supply a patch that checks if kswapd was woken for
    THP and if so ignore pgdat->kswapd_max_order but it'll be a hack and not
    backed up by proper testing. As 3.7 is very close to release and this
    is not a bug we should release with, a safer path is to revert "mm:
    remove __GFP_NO_KSWAPD" for now and revisit it with the view to ironing
    out the balance_pgdat() logic in general.

    Signed-off-by: Mel Gorman
    Cc: Zdenek Kabelac
    Cc: Seth Jennings
    Cc: Valdis Kletnieks
    Cc: Jiri Slaby
    Cc: Rik van Riel
    Cc: Robert Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

09 Oct, 2012

2 commits

  • There was a general sentiment in a recent discussion (See
    https://lkml.org/lkml/2012/9/18/258) that the __GFP flags should be
    defined unconditionally. Currently, the only offender is GFP_NOTRACK,
    which is conditional to KMEMCHECK.

    Signed-off-by: Glauber Costa
    Acked-by: Christoph Lameter
    Cc: Mel Gorman
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • When transparent huge pages were introduced, memory compaction and swap
    storms were an issue, and the kernel had to be careful to not make THP
    allocations cause pageout or compaction.

    Now that we have working compaction deferral, kswapd is smart enough to
    invoke compaction and the quadratic behaviour around isolate_free_pages
    has been fixed, it should be safe to remove __GFP_NO_KSWAPD.

    [minchan@kernel.org: Comment fix]
    [mgorman@suse.de: Avoid direct reclaim for deferred compaction]
    Cc: Andrea Arcangeli
    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

01 Aug, 2012

2 commits

  • Change the skb allocation API to indicate RX usage and use this to fall
    back to the PFMEMALLOC reserve when needed. SKBs allocated from the
    reserve are tagged in skb->pfmemalloc. If an SKB is allocated from the
    reserve and the socket is later found to be unrelated to page reclaim, the
    packet is dropped so that the memory remains available for page reclaim.
    Network protocols are expected to recover from this packet loss.

    [a.p.zijlstra@chello.nl: Ideas taken from various patches]
    [davem@davemloft.net: Use static branches, coding style corrections]
    [sebastian@breakpoint.cc: Avoid unnecessary cast, fix !CONFIG_NET build]
    Signed-off-by: Mel Gorman
    Acked-by: David S. Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • __GFP_MEMALLOC will allow the allocation to disregard the watermarks, much
    like PF_MEMALLOC. It allows one to pass along the memalloc state in
    object related allocation flags as opposed to task related flags, such as
    sk->sk_allocation. This removes the need for ALLOC_PFMEMALLOC as callers
    using __GFP_MEMALLOC can get the ALLOC_NO_WATERMARK flag which is now
    enough to identify allocations related to page reclaim.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

21 May, 2012

3 commits

  • This commit changes various functions that change pages and
    pageblocks migrate type between MIGRATE_ISOLATE and
    MIGRATE_MOVABLE in such a way as to allow to work with
    MIGRATE_CMA migrate type.

    Signed-off-by: Michal Nazarewicz
    Signed-off-by: Marek Szyprowski
    Reviewed-by: KAMEZAWA Hiroyuki
    Tested-by: Rob Clark
    Tested-by: Ohad Ben-Cohen
    Tested-by: Benjamin Gaignard
    Tested-by: Robert Nelson
    Tested-by: Barry Song

    Michal Nazarewicz
     
  • The MIGRATE_CMA migration type has two main characteristics:
    (i) only movable pages can be allocated from MIGRATE_CMA
    pageblocks and (ii) page allocator will never change migration
    type of MIGRATE_CMA pageblocks.

    This guarantees (to some degree) that page in a MIGRATE_CMA page
    block can always be migrated somewhere else (unless there's no
    memory left in the system).

    It is designed to be used for allocating big chunks (eg. 10MiB)
    of physically contiguous memory. Once driver requests
    contiguous memory, pages from MIGRATE_CMA pageblocks may be
    migrated away to create a contiguous block.

    To minimise number of migrations, MIGRATE_CMA migration type
    is the last type tried when page allocator falls back to other
    migration types when requested.

    Signed-off-by: Michal Nazarewicz
    Signed-off-by: Marek Szyprowski
    Signed-off-by: Kyungmin Park
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Tested-by: Rob Clark
    Tested-by: Ohad Ben-Cohen
    Tested-by: Benjamin Gaignard
    Tested-by: Robert Nelson
    Tested-by: Barry Song

    Michal Nazarewicz
     
  • This commit adds the alloc_contig_range() function which tries
    to allocate given range of pages. It tries to migrate all
    already allocated pages that fall in the range thus freeing them.
    Once all pages in the range are freed they are removed from the
    buddy system thus allocated for the caller to use.

    Signed-off-by: Michal Nazarewicz
    Signed-off-by: Marek Szyprowski
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Tested-by: Rob Clark
    Tested-by: Ohad Ben-Cohen
    Tested-by: Benjamin Gaignard
    Tested-by: Robert Nelson
    Tested-by: Barry Song

    Michal Nazarewicz
     

11 Jan, 2012

4 commits

  • The maximum number of dirty pages that exist in the system at any time is
    determined by a number of pages considered dirtyable and a user-configured
    percentage of those, or an absolute number in bytes.

    This number of dirtyable pages is the sum of memory provided by all the
    zones in the system minus their lowmem reserves and high watermarks, so
    that the system can retain a healthy number of free pages without having
    to reclaim dirty pages.

    But there is a flaw in that we have a zoned page allocator which does not
    care about the global state but rather the state of individual memory
    zones. And right now there is nothing that prevents one zone from filling
    up with dirty pages while other zones are spared, which frequently leads
    to situations where kswapd, in order to restore the watermark of free
    pages, does indeed have to write pages from that zone's LRU list. This
    can interfere so badly with IO from the flusher threads that major
    filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim
    already, taking away the VM's only possibility to keep such a zone
    balanced, aside from hoping the flushers will soon clean pages from that
    zone.

    Enter per-zone dirty limits. They are to a zone's dirtyable memory what
    the global limit is to the global amount of dirtyable memory, and try to
    make sure that no single zone receives more than its fair share of the
    globally allowed dirty pages in the first place. As the number of pages
    considered dirtyable excludes the zones' lowmem reserves and high
    watermarks, the maximum number of dirty pages in a zone is such that the
    zone can always be balanced without requiring page cleaning.

    As this is a placement decision in the page allocator and pages are
    dirtied only after the allocation, this patch allows allocators to pass
    __GFP_WRITE when they know in advance that the page will be written to and
    become dirty soon. The page allocator will then attempt to allocate from
    the first zone of the zonelist - which on NUMA is determined by the task's
    NUMA memory policy - that has not exceeded its dirty limit.

    At first glance, it would appear that the diversion to lower zones can
    increase pressure on them, but this is not the case. With a full high
    zone, allocations will be diverted to lower zones eventually, so it is
    more of a shift in timing of the lower zone allocations. Workloads that
    previously could fit their dirty pages completely in the higher zone may
    be forced to allocate from lower zones, but the amount of pages that
    "spill over" are limited themselves by the lower zones' dirty constraints,
    and thus unlikely to become a problem.

    For now, the problem of unfair dirty page distribution remains for NUMA
    configurations where the zones allowed for allocation are in sum not big
    enough to trigger the global dirty limits, wake up the flusher threads and
    remedy the situation. Because of this, an allocation that could not
    succeed on any of the considered zones is allowed to ignore the dirty
    limits before going into direct reclaim or even failing the allocation,
    until a future patch changes the global dirty throttling and flusher
    thread activation so that they take individual zone states into account.

    Test results

    15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
    40% dirty ratio
    16G USB thumb drive
    10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))

    seconds nr_vmscan_write
    (stddev) min| median| max
    xfs
    vanilla: 549.747( 3.492) 0.000| 0.000| 0.000
    patched: 550.996( 3.802) 0.000| 0.000| 0.000

    fuse-ntfs
    vanilla: 1183.094(53.178) 54349.000| 59341.000| 65163.000
    patched: 558.049(17.914) 0.000| 0.000| 43.000

    btrfs
    vanilla: 573.679(14.015) 156657.000| 460178.000| 606926.000
    patched: 563.365(11.368) 0.000| 0.000| 1362.000

    ext4
    vanilla: 561.197(15.782) 0.000|2725438.000|4143837.000
    patched: 568.806(17.496) 0.000| 0.000| 0.000

    Signed-off-by: Johannes Weiner
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Tested-by: Wu Fengguang
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Calling alloc_pages_exact_node() means the allocation only passes the
    zonelist of a single node into the page allocator. If that node isn't
    online, it's zonelist may never have been initialized causing a strange
    oops that may not immediately be clear.

    I recently debugged an issue where node 0 wasn't online and an allocator
    was passing 0 to alloc_pages_exact_node() and it resulted in a NULL
    pointer on zonelist->_zoneref. If CONFIG_DEBUG_VM is enabled, though, it
    would be nice to catch this a bit earlier.

    Signed-off-by: David Rientjes
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Colin Cross reported;

    Under the following conditions, __alloc_pages_slowpath can loop forever:
    gfp_mask & __GFP_WAIT is true
    gfp_mask & __GFP_FS is false
    reclaim and compaction make no progress
    order
    Signed-off-by: Mel Gorman
    Acked-by: David Rientjes
    Cc: Minchan Kim
    Cc: Pekka Enberg
    Cc: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch adds helper free_hot_cold_page_list() to free list of 0-order
    pages. It frees pages directly from list without temporary page-vector.
    It also calls trace_mm_pagevec_free() to simulate pagevec_free()
    behaviour.

    bloat-o-meter:

    add/remove: 1/1 grow/shrink: 1/3 up/down: 267/-295 (-28)
    function old new delta
    free_hot_cold_page_list - 264 +264
    get_page_from_freelist 2129 2132 +3
    __pagevec_free 243 239 -4
    split_free_page 380 373 -7
    release_pages 606 510 -96
    free_page_list 188 - -188

    Signed-off-by: Konstantin Khlebnikov
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Acked-by: Minchan Kim
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

04 Aug, 2011

1 commit

  • __GFP_OTHER_NODE is used for NUMA allocations on behalf of other nodes.
    It's supposed to be passed through from the page allocator to
    zone_statistics(), but it never gets there as gfp_allowed_mask is not
    wide enough and masks out the flag early in the allocation path.

    The result is an accounting glitch where successful NUMA allocations
    by-agent are not properly attributed as local.

    Increase __GFP_BITS_SHIFT so that it includes __GFP_OTHER_NODE.

    Signed-off-by: Johannes Weiner
    Acked-by: Andi Kleen
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

25 May, 2011

2 commits

  • VM_BUG_ON() if effectively a BUG_ON() undef #ifdef CONFIG_DEBUG_VM. That
    is exactly what we have here now, and two different folks have suggested
    doing it this way.

    Signed-off-by: Dave Hansen
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • Running sparse on page_alloc.c today, it errors out:
    include/linux/gfp.h:254:17: error: bad constant expression
    include/linux/gfp.h:254:17: error: cannot size expression

    which is a line in gfp_zone():

    BUILD_BUG_ON((GFP_ZONE_BAD >> bit) & 1);

    That's really unfortunate, because it ends up hiding all of the other
    legitimate sparse messages like this:
    mm/page_alloc.c:5315:59: warning: incorrect type in argument 1 (different base types)
    mm/page_alloc.c:5315:59: expected unsigned long [unsigned] [usertype] size
    mm/page_alloc.c:5315:59: got restricted gfp_t [usertype]
    ...

    Having sparse be able to catch these very oopsable bugs is a lot more
    important than keeping a BUILD_BUG_ON(). Kill the BUILD_BUG_ON().

    Compiles on x86_64 with and without CONFIG_DEBUG_VM=y. defconfig boots
    fine for me.

    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

12 May, 2011

1 commit

  • Add a alloc_pages_exact_nid() that allocates on a specific node.

    The naming is quite broken, but fixing that would need a larger renaming
    action.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: Andi Kleen
    Cc: Michal Hocko
    Cc: Balbir Singh
    Cc: KOSAKI Motohiro
    Cc: Dave Hansen
    Cc: David Rientjes
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

23 Mar, 2011

1 commit

  • Add a new __GFP_OTHER_NODE flag to tell the low level numa statistics in
    zone_statistics() that an allocation is on behalf of another thread. This
    way the local and remote counters can be still correct, even when
    background daemons like khugepaged are changing memory mappings.

    This only affects the accounting, but I think it's worth doing that right
    to avoid confusing users.

    I first tried to just pass down the right node, but this required a lot of
    changes to pass down this parameter and at least one addition of a 10th
    argument to a 9 argument function. Using the flag is a lot less
    intrusive.

    Open: should be also used for migration?

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Andi Kleen
    Cc: Andrea Arcangeli
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

05 Mar, 2011

2 commits

  • Add a alloc_page_vma_node that allows passing the "local" node in. Used
    in a followon patch.

    Acked-by: Andrea Arcangeli
    Signed-off-by: Andi Kleen
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Currently alloc_pages_vma() always uses the local node as policy node for
    the LOCAL policy. Pass this node down as an argument instead.

    No behaviour change from this patch, but will be needed for followons.

    Acked-by: Andrea Arcangeli
    Signed-off-by: Andi Kleen
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

24 Jan, 2011

1 commit


14 Jan, 2011

3 commits

  • It's mostly a matter of replacing alloc_pages with alloc_pages_vma after
    introducing alloc_pages_vma. khugepaged needs special handling as the
    allocation has to happen inside collapse_huge_page where the vma is known
    and an error has to be returned to the outer loop to sleep
    alloc_sleep_millisecs in case of failure. But it retains the more
    efficient logic of handling allocation failures in khugepaged in case of
    CONFIG_NUMA=n.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Lately I've been working to make KVM use hugepages transparently without
    the usual restrictions of hugetlbfs. Some of the restrictions I'd like to
    see removed:

    1) hugepages have to be swappable or the guest physical memory remains
    locked in RAM and can't be paged out to swap

    2) if a hugepage allocation fails, regular pages should be allocated
    instead and mixed in the same vma without any failure and without
    userland noticing

    3) if some task quits and more hugepages become available in the
    buddy, guest physical memory backed by regular pages should be
    relocated on hugepages automatically in regions under
    madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
    kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
    not null)

    4) avoidance of reservation and maximization of use of hugepages whenever
    possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
    1 machine with 1 database with 1 database cache with 1 database cache size
    known at boot time. It's definitely not feasible with a virtualization
    hypervisor usage like RHEV-H that runs an unknown number of virtual machines
    with an unknown size of each virtual machine with an unknown amount of
    pagecache that could be potentially useful in the host for guest not using
    O_DIRECT (aka cache=off).

    hugepages in the virtualization hypervisor (and also in the guest!) are
    much more important than in a regular host not using virtualization,
    becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24
    to 19 in case only the hypervisor uses transparent hugepages, and they
    decrease the tlb-miss cacheline accesses from 19 to 15 in case both the
    linux hypervisor and the linux guest both uses this patch (though the
    guest will limit the addition speedup to anonymous regions only for
    now...). Even more important is that the tlb miss handler is much slower
    on a NPT/EPT guest than for a regular shadow paging or no-virtualization
    scenario. So maximizing the amount of virtual memory cached by the TLB
    pays off significantly more with NPT/EPT than without (even if there would
    be no significant speedup in the tlb-miss runtime).

    The first (and more tedious) part of this work requires allowing the VM to
    handle anonymous hugepages mixed with regular pages transparently on
    regular anonymous vmas. This is what this patch tries to achieve in the
    least intrusive possible way. We want hugepages and hugetlb to be used in
    a way so that all applications can benefit without changes (as usual we
    leverage the KVM virtualization design: by improving the Linux VM at
    large, KVM gets the performance boost too).

    The most important design choice is: always fallback to 4k allocation if
    the hugepage allocation fails! This is the _very_ opposite of some large
    pagecache patches that failed with -EIO back then if a 64k (or similar)
    allocation failed...

    Second important decision (to reduce the impact of the feature on the
    existing pagetable handling code) is that at any time we can split an
    hugepage into 512 regular pages and it has to be done with an operation
    that can't fail. This way the reliability of the swapping isn't decreased
    (no need to allocate memory when we are short on memory to swap) and it's
    trivial to plug a split_huge_page* one-liner where needed without
    polluting the VM. Over time we can teach mprotect, mremap and friends to
    handle pmd_trans_huge natively without calling split_huge_page*. The fact
    it can't fail isn't just for swap: if split_huge_page would return -ENOMEM
    (instead of the current void) we'd need to rollback the mprotect from the
    middle of it (ideally including undoing the split_vma) which would be a
    big change and in the very wrong direction (it'd likely be simpler not to
    call split_huge_page at all and to teach mprotect and friends to handle
    hugepages instead of rolling them back from the middle). In short the
    very value of split_huge_page is that it can't fail.

    The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and
    incremental and it'll just be an "harmless" addition later if this initial
    part is agreed upon. It also should be noted that locking-wise replacing
    regular pages with hugepages is going to be very easy if compared to what
    I'm doing below in split_huge_page, as it will only happen when
    page_count(page) matches page_mapcount(page) if we can take the PG_lock
    and mmap_sem in write mode. collapse_huge_page will be a "best effort"
    that (unlike split_huge_page) can fail at the minimal sign of trouble and
    we can try again later. collapse_huge_page will be similar to how KSM
    works and the madvise(MADV_HUGEPAGE) will work similar to
    madvise(MADV_MERGEABLE).

    The default I like is that transparent hugepages are used at page fault
    time. This can be changed with
    /sys/kernel/mm/transparent_hugepage/enabled. The control knob can be set
    to three values "always", "madvise", "never" which mean respectively that
    hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions,
    or never used. /sys/kernel/mm/transparent_hugepage/defrag instead
    controls if the hugepage allocation should defrag memory aggressively
    "always", only inside "madvise" regions, or "never".

    The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
    put_page (from get_user_page users that can't use mmu notifier like
    O_DIRECT) that runs against a __split_huge_page_refcount instead was a
    pain to serialize in a way that would result always in a coherent page
    count for both tail and head. I think my locking solution with a
    compound_lock taken only after the page_first is valid and is still a
    PageHead should be safe but it surely needs review from SMP race point of
    view. In short there is no current existing way to serialize the O_DIRECT
    final put_page against split_huge_page_refcount so I had to invent a new
    one (O_DIRECT loses knowledge on the mapping status by the time gup_fast
    returns so...). And I didn't want to impact all gup/gup_fast users for
    now, maybe if we change the gup interface substantially we can avoid this
    locking, I admit I didn't think too much about it because changing the gup
    unpinning interface would be invasive.

    If we ignored O_DIRECT we could stick to the existing compound refcounting
    code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM
    (and any other mmu notifier user) would call it without FOLL_GET (and if
    FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the
    current task mmu notifier list yet). But O_DIRECT is fundamental for
    decent performance of virtualized I/O on fast storage so we can't avoid it
    to solve the race of put_page against split_huge_page_refcount to achieve
    a complete hugepage feature for KVM.

    Swap and oom works fine (well just like with regular pages ;). MMU
    notifier is handled transparently too, with the exception of the young bit
    on the pmd, that didn't have a range check but I think KVM will be fine
    because the whole point of hugepages is that EPT/NPT will also use a huge
    pmd when they notice gup returns pages with PageCompound set, so they
    won't care of a range and there's just the pmd young bit to check in that
    case.

    NOTE: in some cases if the L2 cache is small, this may slowdown and waste
    memory during COWs because 4M of memory are accessed in a single fault
    instead of 8k (the payoff is that after COW the program can run faster).
    So we might want to switch the copy_huge_page (and clear_huge_page too) to
    not temporal stores. I also extensively researched ways to avoid this
    cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k
    up to 1M (I can send those patches that fully implemented prefault) but I
    concluded they're not worth it and they add an huge additional complexity
    and they remove all tlb benefits until the full hugepage has been faulted
    in, to save a little bit of memory and some cache during app startup, but
    they still don't improve substantially the cache-trashing during startup
    if the prefault happens in >4k chunks. One reason is that those 4k pte
    entries copied are still mapped on a perfectly cache-colored hugepage, so
    the trashing is the worst one can generate in those copies (cow of 4k page
    copies aren't so well colored so they trashes less, but again this results
    in software running faster after the page fault). Those prefault patches
    allowed things like a pte where post-cow pages were local 4k regular anon
    pages and the not-yet-cowed pte entries were pointing in the middle of
    some hugepage mapped read-only. If it doesn't payoff substantially with
    todays hardware it will payoff even less in the future with larger l2
    caches, and the prefault logic would blot the VM a lot. If one is
    emebdded transparent_hugepage can be disabled during boot with sysfs or
    with the boot commandline parameter transparent_hugepage=0 (or
    transparent_hugepage=2 to restrict hugepages inside madvise regions) that
    will ensure not a single hugepage is allocated at boot time. It is simple
    enough to just disable transparent hugepage globally and let transparent
    hugepages be allocated selectively by applications in the MADV_HUGEPAGE
    region (both at page fault time, and if enabled with the
    collapse_huge_page too through the kernel daemon).

    This patch supports only hugepages mapped in the pmd, archs that have
    smaller hugepages will not fit in this patch alone. Also some archs like
    power have certain tlb limits that prevents mixing different page size in
    the same regions so they will not fit in this framework that requires
    "graceful fallback" to basic PAGE_SIZE in case of physical memory
    fragmentation. hugetlbfs remains a perfect fit for those because its
    software limits happen to match the hardware limits. hugetlbfs also
    remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped
    to be found not fragmented after a certain system uptime and that would be
    very expensive to defragment with relocation, so requiring reservation.
    hugetlbfs is the "reservation way", the point of transparent hugepages is
    not to have any reservation at all and maximizing the use of cache and
    hugepages at all times automatically.

    Some performance result:

    vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
    ages3
    memset page fault 1566023
    memset tlb miss 453854
    memset second tlb miss 453321
    random access tlb miss 41635
    random access second tlb miss 41658
    vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
    memset page fault 1566471
    memset tlb miss 453375
    memset second tlb miss 453320
    random access tlb miss 41636
    random access second tlb miss 41637
    vmx andrea # ./largepages3
    memset page fault 1566642
    memset tlb miss 453417
    memset second tlb miss 453313
    random access tlb miss 41630
    random access second tlb miss 41647
    vmx andrea # ./largepages3
    memset page fault 1566872
    memset tlb miss 453418
    memset second tlb miss 453315
    random access tlb miss 41618
    random access second tlb miss 41659
    vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
    vmx andrea # ./largepages3
    memset page fault 2182476
    memset tlb miss 460305
    memset second tlb miss 460179
    random access tlb miss 44483
    random access second tlb miss 44186
    vmx andrea # ./largepages3
    memset page fault 2182791
    memset tlb miss 460742
    memset second tlb miss 459962
    random access tlb miss 43981
    random access second tlb miss 43988

    ============
    #include
    #include
    #include
    #include

    #define SIZE (3UL*1024*1024*1024)

    int main()
    {
    char *p = malloc(SIZE), *p2;
    struct timeval before, after;

    gettimeofday(&before, NULL);
    memset(p, 0, SIZE);
    gettimeofday(&after, NULL);
    printf("memset page fault %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    memset(p, 0, SIZE);
    gettimeofday(&after, NULL);
    printf("memset tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    memset(p, 0, SIZE);
    gettimeofday(&after, NULL);
    printf("memset second tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    for (p2 = p; p2 < p+SIZE; p2 += 4096)
    *p2 = 0;
    gettimeofday(&after, NULL);
    printf("random access tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    for (p2 = p; p2 < p+SIZE; p2 += 4096)
    *p2 = 0;
    gettimeofday(&after, NULL);
    printf("random access second tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    return 0;
    }
    ============

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Transparent hugepage allocations must be allowed not to invoke kswapd or
    any other kind of indirect reclaim (especially when the defrag sysfs is
    control disabled). It's unacceptable to swap out anonymous pages
    (potentially anonymous transparent hugepages) in order to create new
    transparent hugepages. This is true for the MADV_HUGEPAGE areas too
    (swapping out a kvm virtual machine and so having it suffer an unbearable
    slowdown, so another one with guest physical memory marked MADV_HUGEPAGE
    can run 30% faster if it is running memory intensive workloads, makes no
    sense). If a transparent hugepage allocation fails the slowdown is minor
    and there is total fallback, so kswapd should never be asked to swapout
    memory to allow the high order allocation to succeed.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

07 Dec, 2010

1 commit

  • There is a problem that swap pages allocated before the creation of
    a hibernation image can be released and used for storing the contents
    of different memory pages while the image is being saved. Since the
    kernel stored in the image doesn't know of that, it causes memory
    corruption to occur after resume from hibernation, especially on
    systems with relatively small RAM that need to swap often.

    This issue can be addressed by keeping the GFP_IOFS bits clear
    in gfp_allowed_mask during the entire hibernation, including the
    saving of the image, until the system is finally turned off or
    the hibernation is aborted. Unfortunately, for this purpose
    it's necessary to rework the way in which the hibernate and
    suspend code manipulates gfp_allowed_mask.

    This change is based on an earlier patch from Hugh Dickins.

    Signed-off-by: Rafael J. Wysocki
    Reported-by: Ondrej Zary
    Acked-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: stable@kernel.org

    Rafael J. Wysocki
     

27 Oct, 2010

1 commit

  • Introduce ___GFP_* masks in order for gfp_t to not be mixed with plain
    integers which causes a lot of warnings like the following:

    warning: restricted gfp_t degrades to integer

    Signed-off-by: Namhyung Kim
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     

25 May, 2010

2 commits


07 Mar, 2010

3 commits

  • __GFP_NOFAIL was deprecated in dab48dab, so add a comment that no new
    users should be added.

    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • There are quite a few GFP_KERNEL memory allocations made during
    suspend/hibernation and resume that may cause the system to hang, because
    the I/O operations they depend on cannot be completed due to the
    underlying devices being suspended.

    Avoid this problem by clearing the __GFP_IO and __GFP_FS bits in
    gfp_allowed_mask before suspend/hibernation and restoring the original
    values of these bits in gfp_allowed_mask durig the subsequent resume.

    [akpm@linux-foundation.org: fix CONFIG_PM=n linkage]
    Signed-off-by: Rafael J. Wysocki
    Reported-by: Maxim Levitsky
    Cc: Sebastian Ott
    Cc: Benjamin Herrenschmidt
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • free_hot_page() is just a wrapper around free_hot_cold_page() with
    parameter 'cold = 0'. After adding a clear comment for
    free_hot_cold_page(), it is reasonable to remove a level of call.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Li Hong
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Cc: KOSAKI Motohiro
    Cc: Americo Wang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Hong
     

23 Sep, 2009

1 commit

  • gcc permitting variable length arrays makes the current construct used for
    BUILD_BUG_ON() useless, as that doesn't produce any diagnostic if the
    controlling expression isn't really constant. Instead, this patch makes
    it so that a bit field gets used here. Consequently, those uses where the
    condition isn't really constant now also need fixing.

    Note that in the gfp.h, kmemcheck.h, and virtio_config.h cases
    MAYBE_BUILD_BUG_ON() really just serves documentation purposes - even if
    the expression is compile time constant (__builtin_constant_p() yields
    true), the array is still deemed of variable length by gcc, and hence the
    whole expression doesn't have the intended effect.

    [akpm@linux-foundation.org: make arch/sparc/include/asm/vio.h compile]
    [akpm@linux-foundation.org: more nonsensical assertions in tpm.c..]
    Signed-off-by: Jan Beulich
    Cc: Andi Kleen
    Cc: Rusty Russell
    Cc: Catalin Marinas
    Cc: "David S. Miller"
    Cc: Rajiv Andrade
    Cc: Mimi Zohar
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     

22 Sep, 2009

2 commits