page-allocator: limit the number of MIGRATE_RESERVE pageblocks per zone

After anti-fragmentation was merged, a bug was reported whereby devices that depended on high-order atomic allocations were failing. The solution was to preserve a property in the buddy allocator which tended to keep the minimum number of free pages in the zone at the lower physical addresses and contiguous. To preserve this property, MIGRATE_RESERVE was introduced and a number of pageblocks at the start of a zone would be marked "reserve", the number of which depended on min_free_kbytes. Anti-fragmentation works by avoiding the mixing of page migratetypes within the same pageblock. One way of helping this is to increase min_free_kbytes because it becomes less like that it will be necessary to place pages of of MIGRATE_RESERVE is unbounded, the free memory is kept there in large contiguous blocks instead of helping anti-fragmentation as much as it should. With the page-allocator tracepoint patches applied, it was found during anti-fragmentation tests that the number of fragmentation-related events were far higher than expected even with min_free_kbytes at higher values. This patch limits the number of MIGRATE_RESERVE blocks that exist per zone to two. For example, with a sufficient min_free_kbytes, 4MB of memory will be kept aside on an x86-64 and remain more or less free and contiguous for the systems uptime. This should be sufficient for devices depending on high-order atomic allocations while helping fragmentation control when min_free_kbytes is tuned appropriately. As side-effect of this patch is that the reserve variable is converted to int as unsigned long was the wrong type to use when ensuring that only the required number of reserve blocks are created. With the patches applied, fragmentation-related events as measured by the page allocator tracepoints were significantly reduced when running some fragmentation stress-tests on systems with min_free_kbytes tuned to a value appropriate for hugepage allocations at runtime. On x86, the events recorded were reduced by 99.8%, on x86-64 by 99.72% and on ppc64 by 99.83%. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

page-allocator: limit the number of MIGRATE_RESERVE pageblocks per zone
After anti-fragmentation was merged, a bug was reported whereby devices that depended on high-order atomic allocations were failing. The solution was to preserve a property in the buddy allocator which tended to keep the minimum number of free pages in the zone at the lower physical addresses and contiguous. To preserve this property, MIGRATE_RESERVE was introduced and a number of pageblocks at the start of a zone would be marked "reserve", the number of which depended on min_free_kbytes. Anti-fragmentation works by avoiding the mixing of page migratetypes within the same pageblock. One way of helping this is to increase min_free_kbytes because it becomes less like that it will be necessary to place pages of of MIGRATE_RESERVE is unbounded, the free memory is kept there in large contiguous blocks instead of helping anti-fragmentation as much as it should. With the page-allocator tracepoint patches applied, it was found during anti-fragmentation tests that the number of fragmentation-related events were far higher than expected even with min_free_kbytes at higher values. This patch limits the number of MIGRATE_RESERVE blocks that exist per zone to two. For example, with a sufficient min_free_kbytes, 4MB of memory will be kept aside on an x86-64 and remain more or less free and contiguous for the systems uptime. This should be sufficient for devices depending on high-order atomic allocations while helping fragmentation control when min_free_kbytes is tuned appropriately. As side-effect of this patch is that the reserve variable is converted to int as unsigned long was the wrong type to use when ensuring that only the required number of reserve blocks are created. With the patches applied, fragmentation-related events as measured by the page allocator tracepoints were significantly reduced when running some fragmentation stress-tests on systems with min_free_kbytes tuned to a value appropriate for hugepage allocations at runtime. On x86, the events recorded were reduced by 99.8%, on x86-64 by 99.72% and on ppc64 by 99.83%. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman · Linus Torvalds
1 parent ceddc3a52d
Showing 1 changed file with 11 additions and 1 deletions Side-by-side Diff
mm/page_alloc.c
@@ -2836,13 +2836,23 @@
 {
 	unsigned long start_pfn, pfn, end_pfn;
 	struct page *page;
-	unsigned long reserve, block_migratetype;
+	unsigned long block_migratetype;
+	int reserve;
  
 	/* Get the start pfn, end pfn and the number of blocks to reserve */
 	start_pfn = zone->zone_start_pfn;
 	end_pfn = start_pfn + zone->spanned_pages;
 	reserve = roundup(min_wmark_pages(zone), pageblock_nr_pages) >>
 							pageblock_order;
+
+	/*
+	 * Reserve blocks are generally in place to help high-order atomic
+	 * allocations that are short-lived. A min_free_kbytes value that
+	 * would result in more than 2 reserve blocks for atomic allocations
+	 * is assumed to be in place to help anti-fragmentation for the
+	 * future allocation of hugepages at runtime.
+	 */
+	reserve = min(2, reserve);
  
 	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
 		if (!pfn_valid(pfn))
...	...	@@ -2836,13 +2836,23 @@
2836	2836	{
2837	2837	unsigned long start_pfn, pfn, end_pfn;
2838	2838	struct page *page;
2839		- unsigned long reserve, block_migratetype;
	2839	+ unsigned long block_migratetype;
	2840	+ int reserve;
2840	2841
2841	2842	/* Get the start pfn, end pfn and the number of blocks to reserve */
2842	2843	start_pfn = zone->zone_start_pfn;
2843	2844	end_pfn = start_pfn + zone->spanned_pages;
2844	2845	reserve = roundup(min_wmark_pages(zone), pageblock_nr_pages) >>
2845	2846	pageblock_order;
	2847	+
	2848	+ /*
	2849	+ * Reserve blocks are generally in place to help high-order atomic
	2850	+ * allocations that are short-lived. A min_free_kbytes value that
	2851	+ * would result in more than 2 reserve blocks for atomic allocations
	2852	+ * is assumed to be in place to help anti-fragmentation for the
	2853	+ * future allocation of hugepages at runtime.
	2854	+ */
	2855	+ reserve = min(2, reserve);
2846	2856
2847	2857	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
2848	2858	if (!pfn_valid(pfn))