Commit 78986a678f6ec3759a01976749f4437d8bf2d6c3

Authored by Mel Gorman
Committed by Linus Torvalds
1 parent ceddc3a52d

page-allocator: limit the number of MIGRATE_RESERVE pageblocks per zone

After anti-fragmentation was merged, a bug was reported whereby devices
that depended on high-order atomic allocations were failing.  The solution
was to preserve a property in the buddy allocator which tended to keep the
minimum number of free pages in the zone at the lower physical addresses
and contiguous.  To preserve this property, MIGRATE_RESERVE was introduced
and a number of pageblocks at the start of a zone would be marked
"reserve", the number of which depended on min_free_kbytes.

Anti-fragmentation works by avoiding the mixing of page migratetypes
within the same pageblock.  One way of helping this is to increase
min_free_kbytes because it becomes less like that it will be necessary to
place pages of of MIGRATE_RESERVE is unbounded, the free memory is kept
there in large contiguous blocks instead of helping anti-fragmentation as
much as it should.  With the page-allocator tracepoint patches applied, it
was found during anti-fragmentation tests that the number of
fragmentation-related events were far higher than expected even with
min_free_kbytes at higher values.

This patch limits the number of MIGRATE_RESERVE blocks that exist per zone
to two.  For example, with a sufficient min_free_kbytes, 4MB of memory
will be kept aside on an x86-64 and remain more or less free and
contiguous for the systems uptime.  This should be sufficient for devices
depending on high-order atomic allocations while helping fragmentation
control when min_free_kbytes is tuned appropriately.  As side-effect of
this patch is that the reserve variable is converted to int as unsigned
long was the wrong type to use when ensuring that only the required number
of reserve blocks are created.

With the patches applied, fragmentation-related events as measured by the
page allocator tracepoints were significantly reduced when running some
fragmentation stress-tests on systems with min_free_kbytes tuned to a
value appropriate for hugepage allocations at runtime.  On x86, the events
recorded were reduced by 99.8%, on x86-64 by 99.72% and on ppc64 by
99.83%.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Showing 1 changed file with 11 additions and 1 deletions Side-by-side Diff

... ... @@ -2836,13 +2836,23 @@
2836 2836 {
2837 2837 unsigned long start_pfn, pfn, end_pfn;
2838 2838 struct page *page;
2839   - unsigned long reserve, block_migratetype;
  2839 + unsigned long block_migratetype;
  2840 + int reserve;
2840 2841  
2841 2842 /* Get the start pfn, end pfn and the number of blocks to reserve */
2842 2843 start_pfn = zone->zone_start_pfn;
2843 2844 end_pfn = start_pfn + zone->spanned_pages;
2844 2845 reserve = roundup(min_wmark_pages(zone), pageblock_nr_pages) >>
2845 2846 pageblock_order;
  2847 +
  2848 + /*
  2849 + * Reserve blocks are generally in place to help high-order atomic
  2850 + * allocations that are short-lived. A min_free_kbytes value that
  2851 + * would result in more than 2 reserve blocks for atomic allocations
  2852 + * is assumed to be in place to help anti-fragmentation for the
  2853 + * future allocation of hugepages at runtime.
  2854 + */
  2855 + reserve = min(2, reserve);
2846 2856  
2847 2857 for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
2848 2858 if (!pfn_valid(pfn))