26 Oct, 2012

2 commits


09 Oct, 2012

21 commits

  • reclaim_clean_pages_from_list() reclaims clean pages before migration so
    cc.nr_migratepages should be updated. Currently, there is no problem but
    it can be wrong if we try to use the value in future.

    Signed-off-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Marek Szyprowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
    contiguous memory space.

    This patch makes mlocked pages be migrated out. Of course, it can affect
    realtime processes but in CMA usecase, contiguous memory allocation failing
    is far worse than access latency to an mlocked page being variable while
    CMA is running. If someone wants to make the system realtime, he shouldn't
    enable CMA because stalls can still happen at random times.

    [akpm@linux-foundation.org: tweak comment text, per Mel]
    Signed-off-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Marek Szyprowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • During memory-hotplug, I found NR_ISOLATED_[ANON|FILE] are increasing,
    causing the kernel to hang. When the system doesn't have enough free
    pages, it enters reclaim but never reclaim any pages due to
    too_many_isolated()==true and loops forever.

    The cause is that when we do memory-hotadd after memory-remove,
    __zone_pcp_update() clears a zone's ZONE_STAT_ITEMS in setup_pageset()
    although the vm_stat_diff of all CPUs still have values.

    In addtion, when we offline all pages of the zone, we reset them in
    zone_pcp_reset without draining so we loss some zone stat item.

    Reviewed-by: Wen Congyang
    Signed-off-by: Minchan Kim
    Cc: Kamezawa Hiroyuki
    Cc: Yasuaki Ishimatsu
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • RECLAIM_DISTANCE represents the distance between nodes at which it is
    deemed too costly to allocate from; it's preferred to try to reclaim from
    a local zone before falling back to allocating on a remote node with such
    a distance.

    To do this, zone_reclaim_mode is set if the distance between any two
    nodes on the system is greather than this distance. This, however, ends
    up causing the page allocator to reclaim from every zone regardless of
    its affinity.

    What we really want is to reclaim only from zones that are closer than
    RECLAIM_DISTANCE. This patch adds a nodemask to each node that
    represents the set of nodes that are within this distance. During the
    zone iteration, if the bit for a zone's node is set for the local node,
    then reclaim is attempted; otherwise, the zone is skipped.

    [akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • We should not be seeing non-0 unevictable_pgs_mlockfreed any longer. So
    remove free_page_mlock() from the page freeing paths: __PG_MLOCKED is
    already in PAGE_FLAGS_CHECK_AT_FREE, so free_pages_check() will now be
    checking it, reporting "BUG: Bad page state" if it's ever found set.
    Comment UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed always 0.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michel Lespinasse
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I think zone->present_pages indicates pages that buddy system can management,
    it should be:

    zone->present_pages = spanned pages - absent pages - bootmem pages,

    but is now:
    zone->present_pages = spanned pages - absent pages - memmap pages.

    spanned pages: total size, including holes.
    absent pages: holes.
    bootmem pages: pages used in system boot, managed by bootmem allocator.
    memmap pages: pages used by page structs.

    This may cause zone->present_pages less than it should be. For example,
    numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other
    bootmem will be allocated from ZONE_MOVABLE, so ZONE_NORMAL's
    present_pages should be spanned pages - absent pages, but now it also
    minus memmap pages(free_area_init_core), which are actually allocated from
    ZONE_MOVABLE. When offlining all memory of a zone, this will cause
    zone->present_pages less than 0, because present_pages is unsigned long
    type, it is actually a very large integer, it indirectly caused
    zone->watermark[WMARK_MIN] becomes a large
    integer(setup_per_zone_wmarks()), than cause totalreserve_pages become a
    large integer(calculate_totalreserve_pages()), and finally cause memory
    allocating failure when fork process(__vm_enough_memory()).

    [root@localhost ~]# dmesg
    -bash: fork: Cannot allocate memory

    I think the bug described in

    http://marc.info/?l=linux-mm&m=134502182714186&w=2

    is also caused by wrong zone present pages.

    This patch intends to fix-up zone->present_pages when memory are freed to
    buddy system on x86_64 and IA64 platforms.

    Signed-off-by: Jianguo Wu
    Signed-off-by: Jiang Liu
    Reported-by: Petr Tesarik
    Tested-by: Petr Tesarik
    Cc: "Luck, Tony"
    Cc: Mel Gorman
    Cc: Yinghai Lu
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianguo Wu
     
  • __alloc_contig_migrate_alloc() can be used by memory-hotplug so refactor
    it out (move + rename as a common name) into page_isolation.c.

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Minchan Kim
    Cc: Kamezawa Hiroyuki
    Reviewed-by: Yasuaki Ishimatsu
    Acked-by: Michal Nazarewicz
    Cc: Marek Szyprowski
    Cc: Wen Congyang
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Compaction caches if a pageblock was scanned and no pages were isolated so
    that the pageblocks can be skipped in the future to reduce scanning. This
    information is not cleared by the page allocator based on activity due to
    the impact it would have to the page allocator fast paths. Hence there is
    a requirement that something clear the cache or pageblocks will be skipped
    forever. Currently the cache is cleared if there were a number of recent
    allocation failures and it has not been cleared within the last 5 seconds.
    Time-based decisions like this are terrible as they have no relationship
    to VM activity and is basically a big hammer.

    Unfortunately, accurate heuristics would add cost to some hot paths so
    this patch implements a rough heuristic. There are two cases where the
    cache is cleared.

    1. If a !kswapd process completes a compaction cycle (migrate and free
    scanner meet), the zone is marked compact_blockskip_flush. When kswapd
    goes to sleep, it will clear the cache. This is expected to be the
    common case where the cache is cleared. It does not really matter if
    kswapd happens to be asleep or going to sleep when the flag is set as
    it will be woken on the next allocation request.

    2. If there have been multiple failures recently and compaction just
    finished being deferred then a process will clear the cache and start a
    full scan. This situation happens if there are multiple high-order
    allocation requests under heavy memory pressure.

    The clearing of the PG_migrate_skip bits and other scans is inherently
    racy but the race is harmless. For allocations that can fail such as THP,
    they will simply fail. For requests that cannot fail, they will retry the
    allocation. Tests indicated that scanning rates were roughly similar to
    when the time-based heuristic was used and the allocation success rates
    were similar.

    Signed-off-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Cc: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When compaction was implemented it was known that scanning could
    potentially be excessive. The ideal was that a counter be maintained for
    each pageblock but maintaining this information would incur a severe
    penalty due to a shared writable cache line. It has reached the point
    where the scanning costs are a serious problem, particularly on
    long-lived systems where a large process starts and allocates a large
    number of THPs at the same time.

    Instead of using a shared counter, this patch adds another bit to the
    pageblock flags called PG_migrate_skip. If a pageblock is scanned by
    either migrate or free scanner and 0 pages were isolated, the pageblock is
    marked to be skipped in the future. When scanning, this bit is checked
    before any scanning takes place and the block skipped if set.

    The main difficulty with a patch like this is "when to ignore the cached
    information?" If it's ignored too often, the scanning rates will still be
    excessive. If the information is too stale then allocations will fail
    that might have otherwise succeeded. In this patch

    o CMA always ignores the information
    o If the migrate and free scanner meet then the cached information will
    be discarded if it's at least 5 seconds since the last time the cache
    was discarded
    o If there are a large number of allocation failures, discard the cache.

    The time-based heuristic is very clumsy but there are few choices for a
    better event. Depending solely on multiple allocation failures still
    allows excessive scanning when THP allocations are failing in quick
    succession due to memory pressure. Waiting until memory pressure is
    relieved would cause compaction to continually fail instead of using
    reclaim/compaction to try allocate the page. The time-based mechanism is
    clumsy but a better option is not obvious.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Acked-by: Rafael Aquini
    Cc: Fengguang Wu
    Cc: Michal Nazarewicz
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Kyungmin Park
    Cc: Mark Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This reverts commit 7db8889ab05b ("mm: have order > 0 compaction start
    off where it left") and commit de74f1cc ("mm: have order > 0 compaction
    start near a pageblock with free pages"). These patches were a good
    idea and tests confirmed that they massively reduced the amount of
    scanning but the implementation is complex and tricky to understand. A
    later patch will cache what pageblocks should be skipped and
    reimplements the concept of compact_cached_free_pfn on top for both
    migration and free scanners.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Acked-by: Rafael Aquini
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit 0ee332c14518 ("memblock: Kill early_node_map[]") removed
    early_node_map[]. Clean up the comments to comply with that change.

    Signed-off-by: Wanpeng Li
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Gavin Shan
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • If race between allocation and isolation in memory-hotplug offline
    happens, some pages could be in MIGRATE_MOVABLE of free_list although the
    pageblock's migratetype of the page is MIGRATE_ISOLATE.

    The race could be detected by get_freepage_migratetype in
    __test_page_isolated_in_pageblock. If it is detected, now EBUSY gets
    bubbled all the way up and the hotplug operations fails.

    But better idea is instead of returning and failing memory-hotremove, move
    the free page to the correct list at the time it is detected. It could
    enhance memory-hotremove operation success ratio although the race is
    really rare.

    Suggested by Mel Gorman.

    [akpm@linux-foundation.org: small cleanup]
    Signed-off-by: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Reviewed-by: Yasuaki Ishimatsu
    Acked-by: Mel Gorman
    Cc: Xishi Qiu
    Cc: Wen Congyang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • The page allocator caches the pageblock information in page->private while
    it is in the PCP freelists but this is overwritten with the order of the
    page when freed to the buddy allocator. This patch stores the migratetype
    of the page in the page->index field so that it is available at all times
    when the page remain in free_list.

    This patch adds a new call site in __free_pages_ok so it might be overhead
    a bit but it's for high order allocation. So I believe damage isn't hurt.

    Signed-off-by: Minchan Kim
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Yasuaki Ishimatsu
    Acked-by: Mel Gorman
    Cc: Xishi Qiu
    Cc: Wen Congyang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • The page allocator uses set_page_private and page_private for handling
    migratetype when it frees page. Let's replace them with [set|get]
    _freepage_migratetype to make it more clear.

    Signed-off-by: Minchan Kim
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Yasuaki Ishimatsu
    Acked-by: Mel Gorman
    Cc: Xishi Qiu
    Cc: Wen Congyang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • * Add ALLOC_CMA alloc flag and pass it to [__]zone_watermark_ok()
    (from Minchan Kim).

    * During watermark check decrease available free pages number by
    free CMA pages number if necessary (unmovable allocations cannot
    use pages from CMA areas).

    Signed-off-by: Bartlomiej Zolnierkiewicz
    Signed-off-by: Kyungmin Park
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bartlomiej Zolnierkiewicz
     
  • Add NR_FREE_CMA_PAGES counter to be later used for checking watermark in
    __zone_watermark_ok(). For simplicity and to avoid #ifdef hell make this
    counter always available (not only when CONFIG_CMA=y).

    [akpm@linux-foundation.org: use conventional migratetype naming]
    Signed-off-by: Bartlomiej Zolnierkiewicz
    Signed-off-by: Kyungmin Park
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bartlomiej Zolnierkiewicz
     
  • Isolated free pages shouldn't be accounted to NR_FREE_PAGES counter. Fix
    it by properly decreasing/increasing NR_FREE_PAGES counter in
    set_migratetype_isolate()/unset_migratetype_isolate() and removing counter
    adjustment for isolated pages from free_one_page() and split_free_page().

    Signed-off-by: Bartlomiej Zolnierkiewicz
    Signed-off-by: Kyungmin Park
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bartlomiej Zolnierkiewicz
     
  • page->private gets re-used in __free_one_page() to store page order
    (so trace_mm_page_pcpu_drain() may print order instead of migratetype)
    thus migratetype value must be cached locally.

    Fixes regression introduced in commit a7016235a61d ("mm: fix migratetype
    bug which slowed swapping"). This caused incorrect data to be attached
    to the mm_page_pcpu_drain trace event.

    [akpm@linux-foundation.org: add comment]
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Bartlomiej Zolnierkiewicz
    Signed-off-by: Kyungmin Park
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bartlomiej Zolnierkiewicz
     
  • Drop clean cache pages instead of migration during alloc_contig_range() to
    minimise allocation latency by reducing the amount of migration that is
    necessary. It's useful for CMA because latency of migration is more
    important than evicting the background process's working set. In
    addition, as pages are reclaimed then fewer free pages for migration
    targets are required so it avoids memory reclaiming to get free pages,
    which is a contributory factor to increased latency.

    I measured elapsed time of __alloc_contig_migrate_range() which migrates
    10M in 40M movable zone in QEMU machine.

    Before - 146ms, After - 7ms

    [akpm@linux-foundation.org: fix nommu build]
    Signed-off-by: Mel Gorman
    Signed-off-by: Minchan Kim
    Reviewed-by: Mel Gorman
    Cc: Marek Szyprowski
    Acked-by: Michal Nazarewicz
    Cc: Rik van Riel
    Tested-by: Kyungmin Park
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • While compaction is migrating pages to free up large contiguous blocks
    for allocation it races with other allocation requests that may steal
    these blocks or break them up. This patch alters direct compaction to
    capture a suitable free page as soon as it becomes available to reduce
    this race. It uses similar logic to split_free_page() to ensure that
    watermarks are still obeyed.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When transparent huge pages were introduced, memory compaction and swap
    storms were an issue, and the kernel had to be careful to not make THP
    allocations cause pageout or compaction.

    Now that we have working compaction deferral, kswapd is smart enough to
    invoke compaction and the quadratic behaviour around isolate_free_pages
    has been fixed, it should be safe to remove __GFP_NO_KSWAPD.

    [minchan@kernel.org: Comment fix]
    [mgorman@suse.de: Avoid direct reclaim for deferred compaction]
    Cc: Andrea Arcangeli
    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

18 Sep, 2012

1 commit

  • The heuristic method for buddy has been introduced since commit
    43506fad21ca ("mm/page_alloc.c: simplify calculation of combined index
    of adjacent buddy lists"). But the page address of higher page's buddy
    was wrongly calculated, which will lead page_is_buddy to fail for ever.
    IOW, the heuristic method would be disabled with the wrong page address
    of higher page's buddy.

    Calculating the page address of higher page's buddy should be based
    higher_page with the offset between index of higher page and index of
    higher page's buddy.

    Signed-off-by: Haifeng Li
    Signed-off-by: Gavin Shan
    Reviewed-by: Michal Hocko
    Cc: KyongHo Cho
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: [2.6.38+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Haifeng
     

22 Aug, 2012

2 commits

  • Jim Schutt reported a problem that pointed at compaction contending
    heavily on locks. The workload is straight-forward and in his own words;

    The systems in question have 24 SAS drives spread across 3 HBAs,
    running 24 Ceph OSD instances, one per drive. FWIW these servers
    are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160
    Ceph Linux clients doing dd simultaneously to a Ceph file system
    backed by 12 of these servers.

    Early in the test everything looks fine

    procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    31 15 0 287216 576 38606628 0 0 2 1158 2 14 1 3 95 0 0
    27 15 0 225288 576 38583384 0 0 18 2222016 203357 134876 11 56 17 15 0
    28 17 0 219256 576 38544736 0 0 11 2305932 203141 146296 11 49 23 17 0
    6 18 0 215596 576 38552872 0 0 7 2363207 215264 166502 12 45 22 20 0
    22 18 0 226984 576 38596404 0 0 3 2445741 223114 179527 12 43 23 22 0

    and then it goes to pot

    procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    163 8 0 464308 576 36791368 0 0 11 22210 866 536 3 13 79 4 0
    207 14 0 917752 576 36181928 0 0 712 1345376 134598 47367 7 90 1 2 0
    123 12 0 685516 576 36296148 0 0 429 1386615 158494 60077 8 84 5 3 0
    123 12 0 598572 576 36333728 0 0 1107 1233281 147542 62351 7 84 5 4 0
    622 7 0 660768 576 36118264 0 0 557 1345548 151394 59353 7 85 4 3 0
    223 11 0 283960 576 36463868 0 0 46 1107160 121846 33006 6 93 1 1 0

    Note that system CPU usage is very high blocks being written out has
    dropped by 42%. He analysed this with perf and found

    perf record -g -a sleep 10
    perf report --sort symbol --call-graph fractal,5
    34.63% [k] _raw_spin_lock_irqsave
    |
    |--97.30%-- isolate_freepages
    | compaction_alloc
    | unmap_and_move
    | migrate_pages
    | compact_zone
    | compact_zone_order
    | try_to_compact_pages
    | __alloc_pages_direct_compact
    | __alloc_pages_slowpath
    | __alloc_pages_nodemask
    | alloc_pages_vma
    | do_huge_pmd_anonymous_page
    | handle_mm_fault
    | do_page_fault
    | page_fault
    | |
    | |--87.39%-- skb_copy_datagram_iovec
    | | tcp_recvmsg
    | | inet_recvmsg
    | | sock_recvmsg
    | | sys_recvfrom
    | | system_call
    | | __recv
    | | |
    | | --100.00%-- (nil)
    | |
    | --12.61%-- memcpy
    --2.70%-- [...]

    There was other data but primarily it is all showing that compaction is
    contended heavily on the zone->lock and zone->lru_lock.

    commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled
    while isolating pages for migration] noted that it was possible for
    migration to hold the lru_lock for an excessive amount of time. Very
    broadly speaking this patch expands the concept.

    This patch introduces compact_checklock_irqsave() to check if a lock
    is contended or the process needs to be scheduled. If either condition
    is true then async compaction is aborted and the caller is informed.
    The page allocator will fail a THP allocation if compaction failed due
    to contention. This patch also introduces compact_trylock_irqsave()
    which will acquire the lock only if it is not contended and the process
    does not need to schedule.

    Reported-by: Jim Schutt
    Tested-by: Jim Schutt
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit cfd19c5a9ecf ("mm: only set page->pfmemalloc when
    ALLOC_NO_WATERMARKS was used") tried to narrow down page->pfmemalloc
    setting, but it missed some places the pfmemalloc should be set.

    So, in __slab_alloc, the unalignment pfmemalloc and ALLOC_NO_WATERMARKS
    cause incorrect deactivate_slab() on our core2 server:

    64.73% fio [kernel.kallsyms] [k] _raw_spin_lock
    |
    --- _raw_spin_lock
    |
    |---0.34%-- deactivate_slab
    | __slab_alloc
    | kmem_cache_alloc
    | |

    That causes our fio sync write performance to have a 40% regression.

    Move the checking in get_page_from_freelist() which resolves this issue.

    Signed-off-by: Alex Shi
    Acked-by: Mel Gorman
    Cc: David Miller
    Tested-by: Eric Dumazet
    Tested-by: Sage Weil
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Shi
     

03 Aug, 2012

1 commit

  • Borislav Petkov reports that the new warning added in commit
    88fdf75d1bb5 ("mm: warn if pg_data_t isn't initialized with zero")
    triggers for him, and it is the node_start_pfn field that has already
    been initialized once.

    The call trace looks like this:

    x86_64_start_kernel ->
    x86_64_start_reservations ->
    start_kernel ->
    setup_arch ->
    paging_init ->
    zone_sizes_init ->
    free_area_init_nodes ->
    free_area_init_node

    and (with the warning replaced by debug output), Borislav sees

    On node 0 totalpages: 4193848
    DMA zone: 64 pages used for memmap
    DMA zone: 6 pages reserved
    DMA zone: 3890 pages, LIFO batch:0
    DMA32 zone: 16320 pages used for memmap
    DMA32 zone: 798464 pages, LIFO batch:31
    Normal zone: 52736 pages used for memmap
    Normal zone: 3322368 pages, LIFO batch:31
    free_area_init_node: pgdat->node_start_pfn: 4423680 node_start_pfn: 8617984 node_start_pfn: 12812288
    Cc: Minchan Kim
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Aug, 2012

13 commits

  • pg_data_t is zeroed before reaching free_area_init_core(), so remove the
    now unnecessary initializations.

    Signed-off-by: Minchan Kim
    Cc: Tejun Heo
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Warn if memory-hotplug/boot code doesn't initialize pg_data_t with zero
    when it is allocated. Arch code and memory hotplug already initiailize
    pg_data_t. So this warning should never happen. I select fields randomly
    near the beginning, middle and end of pg_data_t for checking.

    This patch isn't for performance but for removing initialization code
    which is necessary to add whenever we adds new field to pg_data_t or zone.

    Firstly, Andrew suggested clearing out of pg_data_t in MM core part but
    Tejun doesn't like it because in the future, some archs can initialize
    some fields in arch code and pass them into general MM part so blindly
    clearing it out in mm core part would be very annoying.

    Signed-off-by: Minchan Kim
    Cc: Tejun Heo
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • If swap is backed by network storage such as NBD, there is a risk that a
    large number of reclaimers can hang the system by consuming all
    PF_MEMALLOC reserves. To avoid these hangs, the administrator must tune
    min_free_kbytes in advance which is a bit fragile.

    This patch throttles direct reclaimers if half the PF_MEMALLOC reserves
    are in use. If the system is routinely getting throttled the system
    administrator can increase min_free_kbytes so degradation is smoother but
    the system will keep running.

    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The reserve is proportionally distributed over all !highmem zones in the
    system. So we need to allow an emergency allocation access to all zones.
    In order to do that we need to break out of any mempolicy boundaries we
    might have.

    In my opinion that does not break mempolicies as those are user oriented
    and not system oriented. That is, system allocations are not guaranteed
    to be within mempolicy boundaries. For instance IRQs do not even have a
    mempolicy.

    So breaking out of mempolicy boundaries for 'rare' emergency allocations,
    which are always system allocations (as opposed to user) is ok.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • __alloc_pages_slowpath() is called when the number of free pages is below
    the low watermark. If the caller is entitled to use ALLOC_NO_WATERMARKS
    then the page will be marked page->pfmemalloc. This protects more pages
    than are strictly necessary as we only need to protect pages allocated
    below the min watermark (the pfmemalloc reserves).

    This patch only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was
    required to allocate the page.

    [rientjes@google.com: David noticed the problem during review]
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This is needed to allow network softirq packet processing to make use of
    PF_MEMALLOC.

    Currently softirq context cannot use PF_MEMALLOC due to it not being
    associated with a task, and therefore not having task flags to fiddle with
    - thus the gfp to alloc flag mapping ignores the task flags when in
    interrupts (hard or soft) context.

    Allowing softirqs to make use of PF_MEMALLOC therefore requires some
    trickery. This patch borrows the task flags from whatever process happens
    to be preempted by the softirq. It then modifies the gfp to alloc flags
    mapping to not exclude task flags in softirq context, and modify the
    softirq code to save, clear and restore the PF_MEMALLOC flag.

    The save and clear, ensures the preempted task's PF_MEMALLOC flag doesn't
    leak into the softirq. The restore ensures a softirq's PF_MEMALLOC flag
    cannot leak back into the preempted process. This should be safe due to
    the following reasons

    Softirqs can run on multiple CPUs sure but the same task should not be
    executing the same softirq code. Neither should the softirq
    handler be preempted by any other softirq handler so the flags
    should not leak to an unrelated softirq.

    Softirqs re-enable hardware interrupts in __do_softirq() so can be
    preempted by hardware interrupts so PF_MEMALLOC is inherited
    by the hard IRQ. However, this is similar to a process in
    reclaim being preempted by a hardirq. While PF_MEMALLOC is
    set, gfp_to_alloc_flags() distinguishes between hard and
    soft irqs and avoids giving a hardirq the ALLOC_NO_WATERMARKS
    flag.

    If the softirq is deferred to ksoftirq then its flags may be used
    instead of a normal tasks but as the softirq cannot be preempted,
    the PF_MEMALLOC flag does not leak to other code by accident.

    [davem@davemloft.net: Document why PF_MEMALLOC is safe]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • __GFP_MEMALLOC will allow the allocation to disregard the watermarks, much
    like PF_MEMALLOC. It allows one to pass along the memalloc state in
    object related allocation flags as opposed to task related flags, such as
    sk->sk_allocation. This removes the need for ALLOC_PFMEMALLOC as callers
    using __GFP_MEMALLOC can get the ALLOC_NO_WATERMARK flag which is now
    enough to identify allocations related to page reclaim.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When a user or administrator requires swap for their application, they
    create a swap partition and file, format it with mkswap and activate it
    with swapon. Swap over the network is considered as an option in diskless
    systems. The two likely scenarios are when blade servers are used as part
    of a cluster where the form factor or maintenance costs do not allow the
    use of disks and thin clients.

    The Linux Terminal Server Project recommends the use of the Network Block
    Device (NBD) for swap according to the manual at
    https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
    There is also documentation and tutorials on how to setup swap over NBD at
    places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
    nbd-client also documents the use of NBD as swap. Despite this, the fact
    is that a machine using NBD for swap can deadlock within minutes if swap
    is used intensively. This patch series addresses the problem.

    The core issue is that network block devices do not use mempools like
    normal block devices do. As the host cannot control where they receive
    packets from, they cannot reliably work out in advance how much memory
    they might need. Some years ago, Peter Zijlstra developed a series of
    patches that supported swap over an NFS that at least one distribution is
    carrying within their kernels. This patch series borrows very heavily
    from Peter's work to support swapping over NBD as a pre-requisite to
    supporting swap-over-NFS. The bulk of the complexity is concerned with
    preserving memory that is allocated from the PFMEMALLOC reserves for use
    by the network layer which is needed for both NBD and NFS.

    Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
    preserve access to pages allocated under low memory situations
    to callers that are freeing memory.

    Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks

    Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
    reserves without setting PFMEMALLOC.

    Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
    for later use by network packet processing.

    Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required

    Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.

    Patches 7-12 allows network processing to use PFMEMALLOC reserves when
    the socket has been marked as being used by the VM to clean pages. If
    packets are received and stored in pages that were allocated under
    low-memory situations and are unrelated to the VM, the packets
    are dropped.

    Patch 11 reintroduces __skb_alloc_page which the networking
    folk may object to but is needed in some cases to propogate
    pfmemalloc from a newly allocated page to an skb. If there is a
    strong objection, this patch can be dropped with the impact being
    that swap-over-network will be slower in some cases but it should
    not fail.

    Patch 13 is a micro-optimisation to avoid a function call in the
    common case.

    Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
    PFMEMALLOC if necessary.

    Patch 15 notes that it is still possible for the PFMEMALLOC reserve
    to be depleted. To prevent this, direct reclaimers get throttled on
    a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is
    expected that kswapd and the direct reclaimers already running
    will clean enough pages for the low watermark to be reached and
    the throttled processes are woken up.

    Patch 16 adds a statistic to track how often processes get throttled

    Some basic performance testing was run using kernel builds, netperf on
    loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
    sysbench. Each of them were expected to use the sl*b allocators
    reasonably heavily but there did not appear to be significant performance
    variances.

    For testing swap-over-NBD, a machine was booted with 2G of RAM with a
    swapfile backed by NBD. 8*NUM_CPU processes were started that create
    anonymous memory mappings and read them linearly in a loop. The total
    size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
    memory pressure.

    Without the patches and using SLUB, the machine locks up within minutes
    and runs to completion with them applied. With SLAB, the story is
    different as an unpatched kernel run to completion. However, the patched
    kernel completed the test 45% faster.

    MICRO
    3.5.0-rc2 3.5.0-rc2
    vanilla swapnbd
    Unrecognised test vmscan-anon-mmap-write
    MMTests Statistics: duration
    Sys Time Running Test (seconds) 197.80 173.07
    User+Sys Time Running Test (seconds) 206.96 182.03
    Total Elapsed Time (seconds) 3240.70 1762.09

    This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages

    Allocations of pages below the min watermark run a risk of the machine
    hanging due to a lack of memory. To prevent this, only callers who have
    PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
    allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to
    a slab though, nothing prevents other callers consuming free objects
    within those slabs. This patch limits access to slab pages that were
    alloced from the PFMEMALLOC reserves.

    When this patch is applied, pages allocated from below the low watermark
    are returned with page->pfmemalloc set and it is up to the caller to
    determine how the page should be protected. SLAB restricts access to any
    page with page->pfmemalloc set to callers which are known to able to
    access the PFMEMALLOC reserve. If one is not available, an attempt is
    made to allocate a new page rather than use a reserve. SLUB is a bit more
    relaxed in that it only records if the current per-CPU page was allocated
    from PFMEMALLOC reserve and uses another partial slab if the caller does
    not have the necessary GFP or process flags. This was found to be
    sufficient in tests to avoid hangs due to SLUB generally maintaining
    smaller lists than SLAB.

    In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
    a slab allocation even though free objects are available because they are
    being preserved for callers that are freeing pages.

    [a.p.zijlstra@chello.nl: Original implementation]
    [sebastian@breakpoint.cc: Correct order of page flag clearing]
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When hotplug offlining happens on zone A, it starts to mark freed page as
    MIGRATE_ISOLATE type in buddy for preventing further allocation.
    (MIGRATE_ISOLATE is very irony type because it's apparently on buddy but
    we can't allocate them).

    When the memory shortage happens during hotplug offlining, current task
    starts to reclaim, then wake up kswapd. Kswapd checks watermark, then go
    sleep because current zone_watermark_ok_safe doesn't consider
    MIGRATE_ISOLATE freed page count. Current task continue to reclaim in
    direct reclaim path without kswapd's helping. The problem is that
    zone->all_unreclaimable is set by only kswapd so that current task would
    be looping forever like below.

    __alloc_pages_slowpath
    restart:
    wake_all_kswapd
    rebalance:
    __alloc_pages_direct_reclaim
    do_try_to_free_pages
    if global_reclaim && !all_unreclaimable
    return 1; /* It means we did did_some_progress */
    skip __alloc_pages_may_oom
    should_alloc_retry
    goto rebalance;

    If we apply KOSAKI's patch[1] which doesn't depends on kswapd about
    setting zone->all_unreclaimable, we can solve this problem by killing some
    task in direct reclaim path. But it doesn't wake up kswapd, still. It
    could be a problem still if other subsystem needs GFP_ATOMIC request. So
    kswapd should consider MIGRATE_ISOLATE when it calculate free pages BEFORE
    going sleep.

    This patch counts the number of MIGRATE_ISOLATE page block and
    zone_watermark_ok_safe will consider it if the system has such blocks
    (fortunately, it's very rare so no problem in POV overhead and kswapd is
    never hotpath).

    Copy/modify from Mel's quote
    "
    Ideal solution would be "allocating" the pageblock.
    It would keep the free space accounting as it is but historically,
    memory hotplug didn't allocate pages because it would be difficult to
    detect if a pageblock was isolated or if part of some balloon.
    Allocating just full pageblocks would work around this, However,
    it would play very badly with CMA.
    "

    [1] http://lkml.org/lkml/2012/6/14/74

    [akpm@linux-foundation.org: simplify nr_zone_isolate_freepages(), rework zone_watermark_ok_safe() comment, simplify set_pageblock_isolate() and restore_pageblock_isolate()]
    [akpm@linux-foundation.org: fix CONFIG_MEMORY_ISOLATION=n build]
    Signed-off-by: Minchan Kim
    Suggested-by: KOSAKI Motohiro
    Tested-by: Aaditya Kumar
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • __zone_watermark_ok currently compares free_pages which is a signed type
    with z->lowmem_reserve[classzone_idx] which is unsigned which might lead
    to sign overflow if free_pages doesn't satisfy the given order (or it came
    as negative already) and then we rely on the following order loop to fix
    it (which doesn't work for order-0). Let's fix the type conversion and do
    not rely on the given value of free_pages or follow up fixups.

    This patch fixes it because "memory-hotplug: fix kswapd looping forever
    problem" depends on this.

    As benefit of this patch, it doesn't rely on the loop to exit
    __zone_watermark_ok in case of high order check and make the first test
    effective.(ie, if (free_pages
    Tested-by: Aaditya Kumar
    Signed-off-by: Aaditya Kumar
    Signed-off-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • mm/page_alloc.c has some memory isolation functions but they are used only
    when we enable CONFIG_{CMA|MEMORY_HOTPLUG|MEMORY_FAILURE}. So let's make
    it configurable by new CONFIG_MEMORY_ISOLATION so that it can reduce
    binary size and we can check it simple by CONFIG_MEMORY_ISOLATION, not if
    defined CONFIG_{CMA|MEMORY_HOTPLUG|MEMORY_FAILURE}.

    Signed-off-by: Minchan Kim
    Cc: Andi Kleen
    Cc: Marek Szyprowski
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Mark functions used by both boot and memory hotplug as __meminit to reduce
    memory footprint when memory hotplug is disabled.

    Alos guard zone_pcp_update() with CONFIG_MEMORY_HOTPLUG because it's only
    used by memory hotplug code.

    Signed-off-by: Jiang Liu
    Cc: Wei Wang
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rusty Russell
    Cc: Yinghai Lu
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: David Rientjes
    Cc: Keping Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • When a zone becomes empty after memory offlining, free zone->pageset.
    Otherwise it will cause memory leak when adding memory to the empty zone
    again because build_all_zonelists() will allocate zone->pageset for an
    empty zone.

    Signed-off-by: Jiang Liu
    Signed-off-by: Wei Wang
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rusty Russell
    Cc: Yinghai Lu
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: David Rientjes
    Cc: Keping Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu