vmscan: do not unconditionally treat zones that fail zone_reclaim() as full

On NUMA machines, the administrator can configure zone_reclaim_mode that is a more targetted form of direct reclaim. On machines with large NUMA distances for example, a zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be reclaimed if the zone watermarks are not being met. The problem is that zone_reclaim() failing at all means the zone gets marked full. This can cause situations where a zone is usable, but is being skipped because it has been considered full. Take a situation where a large tmpfs mount is occuping a large percentage of memory overall. The pages do not get cleaned or reclaimed by zone_reclaim(), but the zone gets marked full and the zonelist cache considers them not worth trying in the future. This patch makes zone_reclaim() return more fine-grained information about what occured when zone_reclaim() failued. The zone only gets marked full if it really is unreclaimable. If it's a case that the scan did not occur or if enough pages were not reclaimed with the limited reclaim_mode, then the zone is simply skipped. There is a side-effect to this patch. Currently, if zone_reclaim() successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would go ahead. With this patch applied, zone watermarks are rechecked after zone_reclaim() does some work. This bug was introduced by commit 9276b1bc96a132f4068fdee00983c532f43d3a26 ("memory page_alloc zonelist caching speedup") way back in 2.6.19 when the zonelist_cache was introduced. It was not intended that zone_reclaim() aggressively consider the zone to be full when it failed as full direct reclaim can still be an option. Due to the age of the bug, it should be considered a -stable candidate. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

vmscan: do not unconditionally treat zones that fail zone_reclaim() as full
On NUMA machines, the administrator can configure zone_reclaim_mode that is a more targetted form of direct reclaim. On machines with large NUMA distances for example, a zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be reclaimed if the zone watermarks are not being met. The problem is that zone_reclaim() failing at all means the zone gets marked full. This can cause situations where a zone is usable, but is being skipped because it has been considered full. Take a situation where a large tmpfs mount is occuping a large percentage of memory overall. The pages do not get cleaned or reclaimed by zone_reclaim(), but the zone gets marked full and the zonelist cache considers them not worth trying in the future. This patch makes zone_reclaim() return more fine-grained information about what occured when zone_reclaim() failued. The zone only gets marked full if it really is unreclaimable. If it's a case that the scan did not occur or if enough pages were not reclaimed with the limited reclaim_mode, then the zone is simply skipped. There is a side-effect to this patch. Currently, if zone_reclaim() successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would go ahead. With this patch applied, zone watermarks are rechecked after zone_reclaim() does some work. This bug was introduced by commit 9276b1bc96a132f4068fdee00983c532f43d3a26 ("memory page_alloc zonelist caching speedup") way back in 2.6.19 when the zonelist_cache was introduced. It was not intended that zone_reclaim() aggressively consider the zone to be full when it failed as full direct reclaim can still be an option. Due to the age of the bug, it should be considered a -stable candidate. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman · Linus Torvalds
1 parent 90afa5de6f
Showing 3 changed files with 32 additions and 9 deletions Side-by-side Diff
mm/internal.h
mm/page_alloc.c
mm/vmscan.c
@@ -259,5 +259,9 @@
 		     unsigned long start, int len, int flags,
 		     struct page **pages, struct vm_area_struct **vmas);
  
+#define ZONE_RECLAIM_NOSCAN	-2
+#define ZONE_RECLAIM_FULL	-1
+#define ZONE_RECLAIM_SOME	0
+#define ZONE_RECLAIM_SUCCESS	1
 #endif
@@ -1462,15 +1462,33 @@
 		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
 			unsigned long mark;
+			int ret;
+
 			mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
-			if (!zone_watermark_ok(zone, order, mark,
-				    classzone_idx, alloc_flags)) {
-				if (!zone_reclaim_mode ||
-				    !zone_reclaim(zone, gfp_mask, order))
+			if (zone_watermark_ok(zone, order, mark,
+				    classzone_idx, alloc_flags))
+				goto try_this_zone;
+
+			if (zone_reclaim_mode == 0)
+				goto this_zone_full;
+
+			ret = zone_reclaim(zone, gfp_mask, order);
+			switch (ret) {
+			case ZONE_RECLAIM_NOSCAN:
+				/* did not scan */
+				goto try_next_zone;
+			case ZONE_RECLAIM_FULL:
+				/* scanned but unreclaimable */
+				goto this_zone_full;
+			default:
+				/* did we reclaim enough */
+				if (!zone_watermark_ok(zone, order, mark,
+						classzone_idx, alloc_flags))
 					goto this_zone_full;
 			}
 		}
  
+try_this_zone:
 		page = buffered_rmqueue(preferred_zone, zone, order,
 						gfp_mask, migratetype);
 		if (page)
@@ -2492,16 +2492,16 @@
 	 */
 	if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
 	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
-		return 0;
+		return ZONE_RECLAIM_FULL;
  
 	if (zone_is_all_unreclaimable(zone))
-		return 0;
+		return ZONE_RECLAIM_FULL;
  
 	/*
 	 * Do not scan if the allocation should not be delayed.
 	 */
 	if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC))
-			return 0;
+		return ZONE_RECLAIM_NOSCAN;
  
 	/*
 	 * Only run zone reclaim on the local zone or on zones that do not
  
@@ -2511,10 +2511,11 @@
 	 */
 	node_id = zone_to_nid(zone);
 	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
-		return 0;
+		return ZONE_RECLAIM_NOSCAN;
  
 	if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
-		return 0;
+		return ZONE_RECLAIM_NOSCAN;
+
 	ret = __zone_reclaim(zone, gfp_mask, order);
 	zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
...	...	@@ -259,5 +259,9 @@
259	259	unsigned long start, int len, int flags,
260	260	struct page pages, struct vm_area_struct vmas);
261	261
	262	+#define ZONE_RECLAIM_NOSCAN -2
	263	+#define ZONE_RECLAIM_FULL -1
	264	+#define ZONE_RECLAIM_SOME 0
	265	+#define ZONE_RECLAIM_SUCCESS 1
262	266	#endif
...	...	@@ -1462,15 +1462,33 @@
1462	1462	BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
1463	1463	if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
1464	1464	unsigned long mark;
	1465	+ int ret;
	1466	+
1465	1467	mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
1466		- if (!zone_watermark_ok(zone, order, mark,
1467		- classzone_idx, alloc_flags)) {
1468		- if (!zone_reclaim_mode \|\|
1469		- !zone_reclaim(zone, gfp_mask, order))
	1468	+ if (zone_watermark_ok(zone, order, mark,
	1469	+ classzone_idx, alloc_flags))
	1470	+ goto try_this_zone;
	1471	+
	1472	+ if (zone_reclaim_mode == 0)
	1473	+ goto this_zone_full;
	1474	+
	1475	+ ret = zone_reclaim(zone, gfp_mask, order);
	1476	+ switch (ret) {
	1477	+ case ZONE_RECLAIM_NOSCAN:
	1478	+ /* did not scan */
	1479	+ goto try_next_zone;
	1480	+ case ZONE_RECLAIM_FULL:
	1481	+ /* scanned but unreclaimable */
	1482	+ goto this_zone_full;
	1483	+ default:
	1484	+ /* did we reclaim enough */
	1485	+ if (!zone_watermark_ok(zone, order, mark,
	1486	+ classzone_idx, alloc_flags))
1470	1487	goto this_zone_full;
1471	1488	}
1472	1489	}
1473	1490
	1491	+try_this_zone:
1474	1492	page = buffered_rmqueue(preferred_zone, zone, order,
1475	1493	gfp_mask, migratetype);
1476	1494	if (page)
...	...	@@ -2492,16 +2492,16 @@
2492	2492	*/
2493	2493	if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
2494	2494	zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
2495		- return 0;
	2495	+ return ZONE_RECLAIM_FULL;
2496	2496
2497	2497	if (zone_is_all_unreclaimable(zone))
2498		- return 0;
	2498	+ return ZONE_RECLAIM_FULL;
2499	2499
2500	2500	/*
2501	2501	* Do not scan if the allocation should not be delayed.
2502	2502	*/
2503	2503	if (!(gfp_mask & __GFP_WAIT) \|\| (current->flags & PF_MEMALLOC))
2504		- return 0;
	2504	+ return ZONE_RECLAIM_NOSCAN;
2505	2505
2506	2506	/*
2507	2507	* Only run zone reclaim on the local zone or on zones that do not
2508	2508
...	...	@@ -2511,10 +2511,11 @@
2511	2511	*/
2512	2512	node_id = zone_to_nid(zone);
2513	2513	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
2514		- return 0;
	2514	+ return ZONE_RECLAIM_NOSCAN;
2515	2515
2516	2516	if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
2517		- return 0;
	2517	+ return ZONE_RECLAIM_NOSCAN;
	2518	+
2518	2519	ret = __zone_reclaim(zone, gfp_mask, order);
2519	2520	zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
2520	2521