mm: vmscan: throttle reclaim if encountering too many dirty pages under writeback

Workloads that are allocating frequently and writing files place a large number of dirty pages on the LRU. With use-once logic, it is possible for them to reach the end of the LRU quickly requiring the reclaimer to scan more to find clean pages. Ordinarily, processes that are dirtying memory will get throttled by dirty balancing but this is a global heuristic and does not take into account that LRUs are maintained on a per-zone basis. This can lead to a situation whereby reclaim is scanning heavily, skipping over a large number of pages under writeback and recycling them around the LRU consuming CPU. This patch checks how many of the number of pages isolated from the LRU were dirty and under writeback. If a percentage of them under writeback, the process will be throttled if a backing device or the zone is congested. Note that this applies whether it is anonymous or file-backed pages that are under writeback meaning that swapping is potentially throttled. This is intentional due to the fact if the swap device is congested, scanning more pages and dispatching more IO is not going to help matters. The percentage that must be in writeback depends on the priority. At default priority, all of them must be dirty. At DEF_PRIORITY-1, 50% of them must be, DEF_PRIORITY-2, 25% etc. i.e. as pressure increases the greater the likelihood the process will get throttled to allow the flusher threads to make some progress. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: vmscan: throttle reclaim if encountering too many dirty pages under writeback
Workloads that are allocating frequently and writing files place a large number of dirty pages on the LRU. With use-once logic, it is possible for them to reach the end of the LRU quickly requiring the reclaimer to scan more to find clean pages. Ordinarily, processes that are dirtying memory will get throttled by dirty balancing but this is a global heuristic and does not take into account that LRUs are maintained on a per-zone basis. This can lead to a situation whereby reclaim is scanning heavily, skipping over a large number of pages under writeback and recycling them around the LRU consuming CPU. This patch checks how many of the number of pages isolated from the LRU were dirty and under writeback. If a percentage of them under writeback, the process will be throttled if a backing device or the zone is congested. Note that this applies whether it is anonymous or file-backed pages that are under writeback meaning that swapping is potentially throttled. This is intentional due to the fact if the swap device is congested, scanning more pages and dispatching more IO is not going to help matters. The percentage that must be in writeback depends on the priority. At default priority, all of them must be dirty. At DEF_PRIORITY-1, 50% of them must be, DEF_PRIORITY-2, 25% etc. i.e. as pressure increases the greater the likelihood the process will get throttled to allow the flusher threads to make some progress. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman · Linus Torvalds
1 parent f84f6e2b08
Showing 1 changed file with 39 additions and 3 deletions Side-by-side Diff
mm/vmscan.c
@@ -751,7 +751,9 @@
 static unsigned long shrink_page_list(struct list_head *page_list,
 				      struct zone *zone,
 				      struct scan_control *sc,
-				      int priority)
+				      int priority,
+				      unsigned long *ret_nr_dirty,
+				      unsigned long *ret_nr_writeback)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
@@ -759,6 +761,7 @@
 	unsigned long nr_dirty = 0;
 	unsigned long nr_congested = 0;
 	unsigned long nr_reclaimed = 0;
+	unsigned long nr_writeback = 0;
  
 	cond_resched();
  
@@ -795,6 +798,7 @@
 			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
  
 		if (PageWriteback(page)) {
+			nr_writeback++;
 			/*
 			 * Synchronous reclaim cannot queue pages for
 			 * writeback due to the possibility of stack overflow
@@ -1000,6 +1004,8 @@
  
 	list_splice(&ret_pages, page_list);
 	count_vm_events(PGACTIVATE, pgactivate);
+	*ret_nr_dirty += nr_dirty;
+	*ret_nr_writeback += nr_writeback;
 	return nr_reclaimed;
 }
  
@@ -1460,6 +1466,8 @@
 	unsigned long nr_taken;
 	unsigned long nr_anon;
 	unsigned long nr_file;
+	unsigned long nr_dirty = 0;
+	unsigned long nr_writeback = 0;
 	isolate_mode_t reclaim_mode = ISOLATE_INACTIVE;
  
 	while (unlikely(too_many_isolated(zone, file, sc))) {
  
@@ -1512,12 +1520,14 @@
  
 	spin_unlock_irq(&zone->lru_lock);
  
-	nr_reclaimed = shrink_page_list(&page_list, zone, sc, priority);
+	nr_reclaimed = shrink_page_list(&page_list, zone, sc, priority,
+						&nr_dirty, &nr_writeback);
  
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
 		set_reclaim_mode(priority, sc, true);
-		nr_reclaimed += shrink_page_list(&page_list, zone, sc, priority);
+		nr_reclaimed += shrink_page_list(&page_list, zone, sc,
+					priority, &nr_dirty, &nr_writeback);
 	}
  
 	local_irq_disable();
@@ -1526,6 +1536,32 @@
 	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
  
 	putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
+
+	/*
+	 * If reclaim is isolating dirty pages under writeback, it implies
+	 * that the long-lived page allocation rate is exceeding the page
+	 * laundering rate. Either the global limits are not being effective
+	 * at throttling processes due to the page distribution throughout
+	 * zones or there is heavy usage of a slow backing device. The
+	 * only option is to throttle from reclaim context which is not ideal
+	 * as there is no guarantee the dirtying process is throttled in the
+	 * same way balance_dirty_pages() manages.
+	 *
+	 * This scales the number of dirty pages that must be under writeback
+	 * before throttling depending on priority. It is a simple backoff
+	 * function that has the most effect in the range DEF_PRIORITY to
+	 * DEF_PRIORITY-2 which is the priority reclaim is considered to be
+	 * in trouble and reclaim is considered to be in trouble.
+	 *
+	 * DEF_PRIORITY   100% isolated pages must be PageWriteback to throttle
+	 * DEF_PRIORITY-1  50% must be PageWriteback
+	 * DEF_PRIORITY-2  25% must be PageWriteback, kswapd in trouble
+	 * ...
+	 * DEF_PRIORITY-6 For SWAP_CLUSTER_MAX isolated pages, throttle if any
+	 *                     isolated page is PageWriteback
+	 */
+	if (nr_writeback && nr_writeback >= (nr_taken >> (DEF_PRIORITY-priority)))
+		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
  
 	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
 		zone_idx(zone),
...	...	@@ -751,7 +751,9 @@
751	751	static unsigned long shrink_page_list(struct list_head *page_list,
752	752	struct zone *zone,
753	753	struct scan_control *sc,
754		- int priority)
	754	+ int priority,
	755	+ unsigned long *ret_nr_dirty,
	756	+ unsigned long *ret_nr_writeback)
755	757	{
756	758	LIST_HEAD(ret_pages);
757	759	LIST_HEAD(free_pages);
...	...	@@ -759,6 +761,7 @@
759	761	unsigned long nr_dirty = 0;
760	762	unsigned long nr_congested = 0;
761	763	unsigned long nr_reclaimed = 0;
	764	+ unsigned long nr_writeback = 0;
762	765
763	766	cond_resched();
764	767
...	...	@@ -795,6 +798,7 @@
795	798	(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
796	799
797	800	if (PageWriteback(page)) {
	801	+ nr_writeback++;
798	802	/*
799	803	* Synchronous reclaim cannot queue pages for
800	804	* writeback due to the possibility of stack overflow
...	...	@@ -1000,6 +1004,8 @@
1000	1004
1001	1005	list_splice(&ret_pages, page_list);
1002	1006	count_vm_events(PGACTIVATE, pgactivate);
	1007	+ *ret_nr_dirty += nr_dirty;
	1008	+ *ret_nr_writeback += nr_writeback;
1003	1009	return nr_reclaimed;
1004	1010	}
1005	1011
...	...	@@ -1460,6 +1466,8 @@
1460	1466	unsigned long nr_taken;
1461	1467	unsigned long nr_anon;
1462	1468	unsigned long nr_file;
	1469	+ unsigned long nr_dirty = 0;
	1470	+ unsigned long nr_writeback = 0;
1463	1471	isolate_mode_t reclaim_mode = ISOLATE_INACTIVE;
1464	1472
1465	1473	while (unlikely(too_many_isolated(zone, file, sc))) {
1466	1474
...	...	@@ -1512,12 +1520,14 @@
1512	1520
1513	1521	spin_unlock_irq(&zone->lru_lock);
1514	1522
1515		- nr_reclaimed = shrink_page_list(&page_list, zone, sc, priority);
	1523	+ nr_reclaimed = shrink_page_list(&page_list, zone, sc, priority,
	1524	+ &nr_dirty, &nr_writeback);
1516	1525
1517	1526	/* Check if we should syncronously wait for writeback */
1518	1527	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
1519	1528	set_reclaim_mode(priority, sc, true);
1520		- nr_reclaimed += shrink_page_list(&page_list, zone, sc, priority);
	1529	+ nr_reclaimed += shrink_page_list(&page_list, zone, sc,
	1530	+ priority, &nr_dirty, &nr_writeback);
1521	1531	}
1522	1532
1523	1533	local_irq_disable();
...	...	@@ -1526,6 +1536,32 @@
1526	1536	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
1527	1537
1528	1538	putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
	1539	+
	1540	+ /*
	1541	+ * If reclaim is isolating dirty pages under writeback, it implies
	1542	+ * that the long-lived page allocation rate is exceeding the page
	1543	+ * laundering rate. Either the global limits are not being effective
	1544	+ * at throttling processes due to the page distribution throughout
	1545	+ * zones or there is heavy usage of a slow backing device. The
	1546	+ * only option is to throttle from reclaim context which is not ideal
	1547	+ * as there is no guarantee the dirtying process is throttled in the
	1548	+ * same way balance_dirty_pages() manages.
	1549	+ *
	1550	+ * This scales the number of dirty pages that must be under writeback
	1551	+ * before throttling depending on priority. It is a simple backoff
	1552	+ * function that has the most effect in the range DEF_PRIORITY to
	1553	+ * DEF_PRIORITY-2 which is the priority reclaim is considered to be
	1554	+ * in trouble and reclaim is considered to be in trouble.
	1555	+ *
	1556	+ * DEF_PRIORITY 100% isolated pages must be PageWriteback to throttle
	1557	+ * DEF_PRIORITY-1 50% must be PageWriteback
	1558	+ * DEF_PRIORITY-2 25% must be PageWriteback, kswapd in trouble
	1559	+ * ...
	1560	+ * DEF_PRIORITY-6 For SWAP_CLUSTER_MAX isolated pages, throttle if any
	1561	+ * isolated page is PageWriteback
	1562	+ */
	1563	+ if (nr_writeback && nr_writeback >= (nr_taken >> (DEF_PRIORITY-priority)))
	1564	+ wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
1529	1565
1530	1566	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
1531	1567	zone_idx(zone),