mm/page-writeback.c: add strictlimit feature

The feature prevents mistrusted filesystems (ie: FUSE mounts created by unprivileged users) to grow a large number of dirty pages before throttling. For such filesystems balance_dirty_pages always check bdi counters against bdi limits. I.e. even if global "nr_dirty" is under "freerun", it's not allowed to skip bdi checks. The only use case for now is fuse: it sets bdi max_ratio to 1% by default and system administrators are supposed to expect that this limit won't be exceeded. The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag. A filesystem may set the flag when it initializes its BDI. The problematic scenario comes from the fact that nobody pays attention to the NR_WRITEBACK_TEMP counter (i.e. number of pages under fuse writeback). The implementation of fuse writeback releases original page (by calling end_page_writeback) almost immediately. A fuse request queued for real processing bears a copy of original page. Hence, if userspace fuse daemon doesn't finalize write requests in timely manner, an aggressive mmap writer can pollute virtually all memory by those temporary fuse page copies. They are carefully accounted in NR_WRITEBACK_TEMP, but nobody cares. To make further explanations shorter, let me use "NR_WRITEBACK_TEMP problem" as a shortcut for "a possibility of uncontrolled grow of amount of RAM consumed by temporary pages allocated by kernel fuse to process writeback". The problem was very easy to reproduce. There is a trivial example filesystem implementation in fuse userspace distribution: fusexmp_fh.c. I added "sleep(1);" to the write methods, then recompiled and mounted it. Then created a huge file on the mount point and run a simple program which mmap-ed the file to a memory region, then wrote a data to the region. An hour later I observed almost all RAM consumed by fuse writeback. Since then some unrelated changes in kernel fuse made it more difficult to reproduce, but it is still possible now. Putting this theoretical happens-in-the-lab thing aside, there is another thing that really hurts real world (FUSE) users. This is write-through page cache policy FUSE currently uses. I.e. handling write(2), kernel fuse populates page cache and flushes user data to the server synchronously. This is excessively suboptimal. Pavel Emelyanov's patches ("writeback cache policy") solve the problem, but they also make resolving NR_WRITEBACK_TEMP problem absolutely necessary. Otherwise, simply copying a huge file to a fuse mount would result in memory starvation. Miklos, the maintainer of FUSE, believes strictlimit feature the way to go. And eventually putting FUSE topics aside, there is one more use-case for strictlimit feature. Using a slow USB stick (mass storage) in a machine with huge amount of RAM installed is a well-known pain. Let's make simple computations. Assuming 64GB of RAM installed, existing implementation of balance_dirty_pages will start throttling only after 9.6GB of RAM becomes dirty (freerun == 15% of total RAM). So, the command "cp 9GB_file /media/my-usb-storage/" may return in a few seconds, but subsequent "umount /media/my-usb-storage/" will take more than two hours if effective throughput of the storage is, to say, 1MB/sec. After inclusion of strictlimit feature, it will be trivial to add a knob (e.g. /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand. Manually or via udev rule. May be I'm wrong, but it seems to be quite a natural desire to limit the amount of dirty memory for some devices we are not fully trust (in the sense of sustainable throughput). [akpm@linux-foundation.org: fix warning in page-writeback.c] Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Cc: Jan Kara <jack@suse.cz> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/page-writeback.c: add strictlimit feature
The feature prevents mistrusted filesystems (ie: FUSE mounts created by unprivileged users) to grow a large number of dirty pages before throttling. For such filesystems balance_dirty_pages always check bdi counters against bdi limits. I.e. even if global "nr_dirty" is under "freerun", it's not allowed to skip bdi checks. The only use case for now is fuse: it sets bdi max_ratio to 1% by default and system administrators are supposed to expect that this limit won't be exceeded. The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag. A filesystem may set the flag when it initializes its BDI. The problematic scenario comes from the fact that nobody pays attention to the NR_WRITEBACK_TEMP counter (i.e. number of pages under fuse writeback). The implementation of fuse writeback releases original page (by calling end_page_writeback) almost immediately. A fuse request queued for real processing bears a copy of original page. Hence, if userspace fuse daemon doesn't finalize write requests in timely manner, an aggressive mmap writer can pollute virtually all memory by those temporary fuse page copies. They are carefully accounted in NR_WRITEBACK_TEMP, but nobody cares. To make further explanations shorter, let me use "NR_WRITEBACK_TEMP problem" as a shortcut for "a possibility of uncontrolled grow of amount of RAM consumed by temporary pages allocated by kernel fuse to process writeback". The problem was very easy to reproduce. There is a trivial example filesystem implementation in fuse userspace distribution: fusexmp_fh.c. I added "sleep(1);" to the write methods, then recompiled and mounted it. Then created a huge file on the mount point and run a simple program which mmap-ed the file to a memory region, then wrote a data to the region. An hour later I observed almost all RAM consumed by fuse writeback. Since then some unrelated changes in kernel fuse made it more difficult to reproduce, but it is still possible now. Putting this theoretical happens-in-the-lab thing aside, there is another thing that really hurts real world (FUSE) users. This is write-through page cache policy FUSE currently uses. I.e. handling write(2), kernel fuse populates page cache and flushes user data to the server synchronously. This is excessively suboptimal. Pavel Emelyanov's patches ("writeback cache policy") solve the problem, but they also make resolving NR_WRITEBACK_TEMP problem absolutely necessary. Otherwise, simply copying a huge file to a fuse mount would result in memory starvation. Miklos, the maintainer of FUSE, believes strictlimit feature the way to go. And eventually putting FUSE topics aside, there is one more use-case for strictlimit feature. Using a slow USB stick (mass storage) in a machine with huge amount of RAM installed is a well-known pain. Let's make simple computations. Assuming 64GB of RAM installed, existing implementation of balance_dirty_pages will start throttling only after 9.6GB of RAM becomes dirty (freerun == 15% of total RAM). So, the command "cp 9GB_file /media/my-usb-storage/" may return in a few seconds, but subsequent "umount /media/my-usb-storage/" will take more than two hours if effective throughput of the storage is, to say, 1MB/sec. After inclusion of strictlimit feature, it will be trivial to add a knob (e.g. /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand. Manually or via udev rule. May be I'm wrong, but it seems to be quite a natural desire to limit the amount of dirty memory for some devices we are not fully trust (in the sense of sustainable throughput). [akpm@linux-foundation.org: fix warning in page-writeback.c] Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Cc: Jan Kara <jack@suse.cz> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Maxim Patlasov · Linus Torvalds
1 parent 4c3bffc272
Showing 3 changed files with 206 additions and 62 deletions Side-by-side Diff
fs/fuse/inode.c
include/linux/backing-dev.h
mm/page-writeback.c
@@ -930,7 +930,7 @@
 	fc->bdi.name = "fuse";
 	fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
 	/* fuse does it's own writeback accounting */
-	fc->bdi.capabilities = BDI_CAP_NO_ACCT_WB;
+	fc->bdi.capabilities = BDI_CAP_NO_ACCT_WB | BDI_CAP_STRICTLIMIT;
  
 	err = bdi_init(&fc->bdi);
 	if (err)
@@ -243,6 +243,8 @@
  * BDI_CAP_EXEC_MAP:       Can be mapped for execution
  *
  * BDI_CAP_SWAP_BACKED:    Count shmem/tmpfs objects as swap-backed.
+ *
+ * BDI_CAP_STRICTLIMIT:    Keep number of dirty pages below bdi threshold.
  */
 #define BDI_CAP_NO_ACCT_DIRTY	0x00000001
 #define BDI_CAP_NO_WRITEBACK	0x00000002
@@ -254,6 +256,7 @@
 #define BDI_CAP_NO_ACCT_WB	0x00000080
 #define BDI_CAP_SWAP_BACKED	0x00000100
 #define BDI_CAP_STABLE_WRITES	0x00000200
+#define BDI_CAP_STRICTLIMIT	0x00000400
  
 #define BDI_CAP_VMFLAGS \
 	(BDI_CAP_READ_MAP | BDI_CAP_WRITE_MAP | BDI_CAP_EXEC_MAP)
@@ -585,6 +585,37 @@
 }
  
 /*
+ *                           setpoint - dirty 3
+ *        f(dirty) := 1.0 + (----------------)
+ *                           limit - setpoint
+ *
+ * it's a 3rd order polynomial that subjects to
+ *
+ * (1) f(freerun)  = 2.0 => rampup dirty_ratelimit reasonably fast
+ * (2) f(setpoint) = 1.0 => the balance point
+ * (3) f(limit)    = 0   => the hard limit
+ * (4) df/dx      <= 0	 => negative feedback control
+ * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+ *     => fast response on large errors; small oscillation near setpoint
+ */
+static inline long long pos_ratio_polynom(unsigned long setpoint,
+					  unsigned long dirty,
+					  unsigned long limit)
+{
+	long long pos_ratio;
+	long x;
+
+	x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
+		    limit - setpoint + 1);
+	pos_ratio = x;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+	return clamp(pos_ratio, 0LL, 2LL << RATELIMIT_CALC_SHIFT);
+}
+
+/*
  * Dirty position control.
  *
  * (o) global/bdi setpoints
  
  
  
  
@@ -682,27 +713,81 @@
 	/*
 	 * global setpoint
 	 *
-	 *                           setpoint - dirty 3
-	 *        f(dirty) := 1.0 + (----------------)
-	 *                           limit - setpoint
+	 * See comment for pos_ratio_polynom().
+	 */
+	setpoint = (freerun + limit) / 2;
+	pos_ratio = pos_ratio_polynom(setpoint, dirty, limit);
+
+	/*
+	 * The strictlimit feature is a tool preventing mistrusted filesystems
+	 * from growing a large number of dirty pages before throttling. For
+	 * such filesystems balance_dirty_pages always checks bdi counters
+	 * against bdi limits. Even if global "nr_dirty" is under "freerun".
+	 * This is especially important for fuse which sets bdi->max_ratio to
+	 * 1% by default. Without strictlimit feature, fuse writeback may
+	 * consume arbitrary amount of RAM because it is accounted in
+	 * NR_WRITEBACK_TEMP which is not involved in calculating "nr_dirty".
 	 *
-	 * it's a 3rd order polynomial that subjects to
+	 * Here, in bdi_position_ratio(), we calculate pos_ratio based on
+	 * two values: bdi_dirty and bdi_thresh. Let's consider an example:
+	 * total amount of RAM is 16GB, bdi->max_ratio is equal to 1%, global
+	 * limits are set by default to 10% and 20% (background and throttle).
+	 * Then bdi_thresh is 1% of 20% of 16GB. This amounts to ~8K pages.
+	 * bdi_dirty_limit(bdi, bg_thresh) is about ~4K pages. bdi_setpoint is
+	 * about ~6K pages (as the average of background and throttle bdi
+	 * limits). The 3rd order polynomial will provide positive feedback if
+	 * bdi_dirty is under bdi_setpoint and vice versa.
 	 *
-	 * (1) f(freerun)  = 2.0 => rampup dirty_ratelimit reasonably fast
-	 * (2) f(setpoint) = 1.0 => the balance point
-	 * (3) f(limit)    = 0   => the hard limit
-	 * (4) df/dx      <= 0	 => negative feedback control
-	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
-	 *     => fast response on large errors; small oscillation near setpoint
+	 * Note, that we cannot use global counters in these calculations
+	 * because we want to throttle process writing to a strictlimit BDI
+	 * much earlier than global "freerun" is reached (~23MB vs. ~2.3GB
+	 * in the example above).
 	 */
-	setpoint = (freerun + limit) / 2;
-	x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
-		    limit - setpoint + 1);
-	pos_ratio = x;
-	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
-	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
-	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+	if (unlikely(bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
+		long long bdi_pos_ratio;
+		unsigned long bdi_bg_thresh;
  
+		if (bdi_dirty < 8)
+			return min_t(long long, pos_ratio * 2,
+				     2 << RATELIMIT_CALC_SHIFT);
+
+		if (bdi_dirty >= bdi_thresh)
+			return 0;
+
+		bdi_bg_thresh = div_u64((u64)bdi_thresh * bg_thresh, thresh);
+		bdi_setpoint = dirty_freerun_ceiling(bdi_thresh,
+						     bdi_bg_thresh);
+
+		if (bdi_setpoint == 0 || bdi_setpoint == bdi_thresh)
+			return 0;
+
+		bdi_pos_ratio = pos_ratio_polynom(bdi_setpoint, bdi_dirty,
+						  bdi_thresh);
+
+		/*
+		 * Typically, for strictlimit case, bdi_setpoint << setpoint
+		 * and pos_ratio >> bdi_pos_ratio. In the other words global
+		 * state ("dirty") is not limiting factor and we have to
+		 * make decision based on bdi counters. But there is an
+		 * important case when global pos_ratio should get precedence:
+		 * global limits are exceeded (e.g. due to activities on other
+		 * BDIs) while given strictlimit BDI is below limit.
+		 *
+		 * "pos_ratio * bdi_pos_ratio" would work for the case above,
+		 * but it would look too non-natural for the case of all
+		 * activity in the system coming from a single strictlimit BDI
+		 * with bdi->max_ratio == 100%.
+		 *
+		 * Note that min() below somewhat changes the dynamics of the
+		 * control system. Normally, pos_ratio value can be well over 3
+		 * (when globally we are at freerun and bdi is well below bdi
+		 * setpoint). Now the maximum pos_ratio in the same situation
+		 * is 2. We might want to tweak this if we observe the control
+		 * system is too slow to adapt.
+		 */
+		return min(pos_ratio, bdi_pos_ratio);
+	}
+
 	/*
 	 * We have computed basic pos_ratio above based on global situation. If
 	 * the bdi is over/under its share of dirty pages, we want to scale
@@ -994,6 +1079,27 @@
 	 * keep that period small to reduce time lags).
 	 */
 	step = 0;
+
+	/*
+	 * For strictlimit case, calculations above were based on bdi counters
+	 * and limits (starting from pos_ratio = bdi_position_ratio() and up to
+	 * balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate).
+	 * Hence, to calculate "step" properly, we have to use bdi_dirty as
+	 * "dirty" and bdi_setpoint as "setpoint".
+	 *
+	 * We rampup dirty_ratelimit forcibly if bdi_dirty is low because
+	 * it's possible that bdi_thresh is close to zero due to inactivity
+	 * of backing device (see the implementation of bdi_dirty_limit()).
+	 */
+	if (unlikely(bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
+		dirty = bdi_dirty;
+		if (bdi_dirty < 8)
+			setpoint = bdi_dirty + 1;
+		else
+			setpoint = (bdi_thresh +
+				    bdi_dirty_limit(bdi, bg_thresh)) / 2;
+	}
+
 	if (dirty < setpoint) {
 		x = min(bdi->balanced_dirty_ratelimit,
 			 min(balanced_dirty_ratelimit, task_ratelimit));
@@ -1198,6 +1304,56 @@
 	return pages >= DIRTY_POLL_THRESH ? 1 + t / 2 : t;
 }
  
+static inline void bdi_dirty_limits(struct backing_dev_info *bdi,
+				    unsigned long dirty_thresh,
+				    unsigned long background_thresh,
+				    unsigned long *bdi_dirty,
+				    unsigned long *bdi_thresh,
+				    unsigned long *bdi_bg_thresh)
+{
+	unsigned long bdi_reclaimable;
+
+	/*
+	 * bdi_thresh is not treated as some limiting factor as
+	 * dirty_thresh, due to reasons
+	 * - in JBOD setup, bdi_thresh can fluctuate a lot
+	 * - in a system with HDD and USB key, the USB key may somehow
+	 *   go into state (bdi_dirty >> bdi_thresh) either because
+	 *   bdi_dirty starts high, or because bdi_thresh drops low.
+	 *   In this case we don't want to hard throttle the USB key
+	 *   dirtiers for 100 seconds until bdi_dirty drops under
+	 *   bdi_thresh. Instead the auxiliary bdi control line in
+	 *   bdi_position_ratio() will let the dirtier task progress
+	 *   at some rate <= (write_bw / 2) for bringing down bdi_dirty.
+	 */
+	*bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
+
+	if (bdi_bg_thresh)
+		*bdi_bg_thresh = div_u64((u64)*bdi_thresh *
+					 background_thresh,
+					 dirty_thresh);
+
+	/*
+	 * In order to avoid the stacked BDI deadlock we need
+	 * to ensure we accurately count the 'dirty' pages when
+	 * the threshold is low.
+	 *
+	 * Otherwise it would be possible to get thresh+n pages
+	 * reported dirty, even though there are thresh-m pages
+	 * actually dirty; with m+n sitting in the percpu
+	 * deltas.
+	 */
+	if (*bdi_thresh < 2 * bdi_stat_error(bdi)) {
+		bdi_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
+		*bdi_dirty = bdi_reclaimable +
+			bdi_stat_sum(bdi, BDI_WRITEBACK);
+	} else {
+		bdi_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
+		*bdi_dirty = bdi_reclaimable +
+			bdi_stat(bdi, BDI_WRITEBACK);
+	}
+}
+
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
  
  
@@ -1209,13 +1365,9 @@
 				unsigned long pages_dirtied)
 {
 	unsigned long nr_reclaimable;	/* = file_dirty + unstable_nfs */
-	unsigned long bdi_reclaimable;
 	unsigned long nr_dirty;  /* = file_dirty + writeback + unstable_nfs */
-	unsigned long bdi_dirty;
-	unsigned long freerun;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
-	unsigned long bdi_thresh;
 	long period;
 	long pause;
 	long max_pause;
  
@@ -1226,10 +1378,16 @@
 	unsigned long dirty_ratelimit;
 	unsigned long pos_ratio;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT;
 	unsigned long start_time = jiffies;
  
 	for (;;) {
 		unsigned long now = jiffies;
+		unsigned long uninitialized_var(bdi_thresh);
+		unsigned long thresh;
+		unsigned long uninitialized_var(bdi_dirty);
+		unsigned long dirty;
+		unsigned long bg_thresh;
  
 		/*
 		 * Unstable writes are a feature of certain networked
  
  
  
  
  
  
@@ -1243,61 +1401,44 @@
  
 		global_dirty_limits(&background_thresh, &dirty_thresh);
  
+		if (unlikely(strictlimit)) {
+			bdi_dirty_limits(bdi, dirty_thresh, background_thresh,
+					 &bdi_dirty, &bdi_thresh, &bg_thresh);
+
+			dirty = bdi_dirty;
+			thresh = bdi_thresh;
+		} else {
+			dirty = nr_dirty;
+			thresh = dirty_thresh;
+			bg_thresh = background_thresh;
+		}
+
 		/*
 		 * Throttle it only when the background writeback cannot
 		 * catch-up. This avoids (excessively) small writeouts
-		 * when the bdi limits are ramping up.
+		 * when the bdi limits are ramping up in case of !strictlimit.
+		 *
+		 * In strictlimit case make decision based on the bdi counters
+		 * and limits. Small writeouts when the bdi limits are ramping
+		 * up are the price we consciously pay for strictlimit-ing.
 		 */
-		freerun = dirty_freerun_ceiling(dirty_thresh,
-						background_thresh);
-		if (nr_dirty <= freerun) {
+		if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh)) {
 			current->dirty_paused_when = now;
 			current->nr_dirtied = 0;
 			current->nr_dirtied_pause =
-				dirty_poll_interval(nr_dirty, dirty_thresh);
+				dirty_poll_interval(dirty, thresh);
 			break;
 		}
  
 		if (unlikely(!writeback_in_progress(bdi)))
 			bdi_start_background_writeback(bdi);
  
-		/*
-		 * bdi_thresh is not treated as some limiting factor as
-		 * dirty_thresh, due to reasons
-		 * - in JBOD setup, bdi_thresh can fluctuate a lot
-		 * - in a system with HDD and USB key, the USB key may somehow
-		 *   go into state (bdi_dirty >> bdi_thresh) either because
-		 *   bdi_dirty starts high, or because bdi_thresh drops low.
-		 *   In this case we don't want to hard throttle the USB key
-		 *   dirtiers for 100 seconds until bdi_dirty drops under
-		 *   bdi_thresh. Instead the auxiliary bdi control line in
-		 *   bdi_position_ratio() will let the dirtier task progress
-		 *   at some rate <= (write_bw / 2) for bringing down bdi_dirty.
-		 */
-		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
+		if (!strictlimit)
+			bdi_dirty_limits(bdi, dirty_thresh, background_thresh,
+					 &bdi_dirty, &bdi_thresh, NULL);
  
-		/*
-		 * In order to avoid the stacked BDI deadlock we need
-		 * to ensure we accurately count the 'dirty' pages when
-		 * the threshold is low.
-		 *
-		 * Otherwise it would be possible to get thresh+n pages
-		 * reported dirty, even though there are thresh-m pages
-		 * actually dirty; with m+n sitting in the percpu
-		 * deltas.
-		 */
-		if (bdi_thresh < 2 * bdi_stat_error(bdi)) {
-			bdi_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_reclaimable +
-				    bdi_stat_sum(bdi, BDI_WRITEBACK);
-		} else {
-			bdi_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_reclaimable +
-				    bdi_stat(bdi, BDI_WRITEBACK);
-		}
-
 		dirty_exceeded = (bdi_dirty > bdi_thresh) &&
-				  (nr_dirty > dirty_thresh);
+				 ((nr_dirty > dirty_thresh) || strictlimit);
 		if (dirty_exceeded && !bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
...	...	@@ -930,7 +930,7 @@
930	930	fc->bdi.name = "fuse";
931	931	fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
932	932	/* fuse does it's own writeback accounting */
933		- fc->bdi.capabilities = BDI_CAP_NO_ACCT_WB;
	933	+ fc->bdi.capabilities = BDI_CAP_NO_ACCT_WB \| BDI_CAP_STRICTLIMIT;
934	934
935	935	err = bdi_init(&fc->bdi);
936	936	if (err)
...	...	@@ -243,6 +243,8 @@
243	243	* BDI_CAP_EXEC_MAP: Can be mapped for execution
244	244	*
245	245	* BDI_CAP_SWAP_BACKED: Count shmem/tmpfs objects as swap-backed.
	246	+ *
	247	+ * BDI_CAP_STRICTLIMIT: Keep number of dirty pages below bdi threshold.
246	248	*/
247	249	#define BDI_CAP_NO_ACCT_DIRTY 0x00000001
248	250	#define BDI_CAP_NO_WRITEBACK 0x00000002
...	...	@@ -254,6 +256,7 @@
254	256	#define BDI_CAP_NO_ACCT_WB 0x00000080
255	257	#define BDI_CAP_SWAP_BACKED 0x00000100
256	258	#define BDI_CAP_STABLE_WRITES 0x00000200
	259	+#define BDI_CAP_STRICTLIMIT 0x00000400
257	260
258	261	#define BDI_CAP_VMFLAGS \
259	262	(BDI_CAP_READ_MAP \| BDI_CAP_WRITE_MAP \| BDI_CAP_EXEC_MAP)
...	...	@@ -585,6 +585,37 @@
585	585	}
586	586
587	587	/*
	588	+ * setpoint - dirty 3
	589	+ * f(dirty) := 1.0 + (----------------)
	590	+ * limit - setpoint
	591	+ *
	592	+ * it's a 3rd order polynomial that subjects to
	593	+ *
	594	+ * (1) f(freerun) = 2.0 => rampup dirty_ratelimit reasonably fast
	595	+ * (2) f(setpoint) = 1.0 => the balance point
	596	+ * (3) f(limit) = 0 => the hard limit
	597	+ * (4) df/dx <= 0 => negative feedback control
	598	+ * (5) the closer to setpoint, the smaller \|df/dx\| (and the reverse)
	599	+ * => fast response on large errors; small oscillation near setpoint
	600	+ */
	601	+static inline long long pos_ratio_polynom(unsigned long setpoint,
	602	+ unsigned long dirty,
	603	+ unsigned long limit)
	604	+{
	605	+ long long pos_ratio;
	606	+ long x;
	607	+
	608	+ x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
	609	+ limit - setpoint + 1);
	610	+ pos_ratio = x;
	611	+ pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
	612	+ pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
	613	+ pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
	614	+
	615	+ return clamp(pos_ratio, 0LL, 2LL << RATELIMIT_CALC_SHIFT);
	616	+}
	617	+
	618	+/*
588	619	* Dirty position control.
589	620	*
590	621	* (o) global/bdi setpoints
591	622
592	623
593	624
594	625
...	...	@@ -682,27 +713,81 @@
682	713	/*
683	714	* global setpoint
684	715	*
685		- * setpoint - dirty 3
686		- * f(dirty) := 1.0 + (----------------)
687		- * limit - setpoint
	716	+ * See comment for pos_ratio_polynom().
	717	+ */
	718	+ setpoint = (freerun + limit) / 2;
	719	+ pos_ratio = pos_ratio_polynom(setpoint, dirty, limit);
	720	+
	721	+ /*
	722	+ * The strictlimit feature is a tool preventing mistrusted filesystems
	723	+ * from growing a large number of dirty pages before throttling. For
	724	+ * such filesystems balance_dirty_pages always checks bdi counters
	725	+ * against bdi limits. Even if global "nr_dirty" is under "freerun".
	726	+ * This is especially important for fuse which sets bdi->max_ratio to
	727	+ * 1% by default. Without strictlimit feature, fuse writeback may
	728	+ * consume arbitrary amount of RAM because it is accounted in
	729	+ * NR_WRITEBACK_TEMP which is not involved in calculating "nr_dirty".
688	730	*
689		- * it's a 3rd order polynomial that subjects to
	731	+ * Here, in bdi_position_ratio(), we calculate pos_ratio based on
	732	+ * two values: bdi_dirty and bdi_thresh. Let's consider an example:
	733	+ * total amount of RAM is 16GB, bdi->max_ratio is equal to 1%, global
	734	+ * limits are set by default to 10% and 20% (background and throttle).
	735	+ * Then bdi_thresh is 1% of 20% of 16GB. This amounts to ~8K pages.
	736	+ * bdi_dirty_limit(bdi, bg_thresh) is about ~4K pages. bdi_setpoint is
	737	+ * about ~6K pages (as the average of background and throttle bdi
	738	+ * limits). The 3rd order polynomial will provide positive feedback if
	739	+ * bdi_dirty is under bdi_setpoint and vice versa.
690	740	*
691		- * (1) f(freerun) = 2.0 => rampup dirty_ratelimit reasonably fast
692		- * (2) f(setpoint) = 1.0 => the balance point
693		- * (3) f(limit) = 0 => the hard limit
694		- * (4) df/dx <= 0 => negative feedback control
695		- * (5) the closer to setpoint, the smaller \|df/dx\| (and the reverse)
696		- * => fast response on large errors; small oscillation near setpoint
	741	+ * Note, that we cannot use global counters in these calculations
	742	+ * because we want to throttle process writing to a strictlimit BDI
	743	+ * much earlier than global "freerun" is reached (~23MB vs. ~2.3GB
	744	+ * in the example above).
697	745	*/
698		- setpoint = (freerun + limit) / 2;
699		- x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
700		- limit - setpoint + 1);
701		- pos_ratio = x;
702		- pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
703		- pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
704		- pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
	746	+ if (unlikely(bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
	747	+ long long bdi_pos_ratio;
	748	+ unsigned long bdi_bg_thresh;
705	749
	750	+ if (bdi_dirty < 8)
	751	+ return min_t(long long, pos_ratio * 2,
	752	+ 2 << RATELIMIT_CALC_SHIFT);
	753	+
	754	+ if (bdi_dirty >= bdi_thresh)
	755	+ return 0;
	756	+
	757	+ bdi_bg_thresh = div_u64((u64)bdi_thresh * bg_thresh, thresh);
	758	+ bdi_setpoint = dirty_freerun_ceiling(bdi_thresh,
	759	+ bdi_bg_thresh);
	760	+
	761	+ if (bdi_setpoint == 0 \|\| bdi_setpoint == bdi_thresh)
	762	+ return 0;
	763	+
	764	+ bdi_pos_ratio = pos_ratio_polynom(bdi_setpoint, bdi_dirty,
	765	+ bdi_thresh);
	766	+
	767	+ /*
	768	+ * Typically, for strictlimit case, bdi_setpoint << setpoint
	769	+ * and pos_ratio >> bdi_pos_ratio. In the other words global
	770	+ * state ("dirty") is not limiting factor and we have to
	771	+ * make decision based on bdi counters. But there is an
	772	+ * important case when global pos_ratio should get precedence:
	773	+ * global limits are exceeded (e.g. due to activities on other
	774	+ * BDIs) while given strictlimit BDI is below limit.
	775	+ *
	776	+ * "pos_ratio * bdi_pos_ratio" would work for the case above,
	777	+ * but it would look too non-natural for the case of all
	778	+ * activity in the system coming from a single strictlimit BDI
	779	+ * with bdi->max_ratio == 100%.
	780	+ *
	781	+ * Note that min() below somewhat changes the dynamics of the
	782	+ * control system. Normally, pos_ratio value can be well over 3
	783	+ * (when globally we are at freerun and bdi is well below bdi
	784	+ * setpoint). Now the maximum pos_ratio in the same situation
	785	+ * is 2. We might want to tweak this if we observe the control
	786	+ * system is too slow to adapt.
	787	+ */
	788	+ return min(pos_ratio, bdi_pos_ratio);
	789	+ }
	790	+
706	791	/*
707	792	* We have computed basic pos_ratio above based on global situation. If
708	793	* the bdi is over/under its share of dirty pages, we want to scale
...	...	@@ -994,6 +1079,27 @@
994	1079	* keep that period small to reduce time lags).
995	1080	*/
996	1081	step = 0;
	1082	+
	1083	+ /*
	1084	+ * For strictlimit case, calculations above were based on bdi counters
	1085	+ * and limits (starting from pos_ratio = bdi_position_ratio() and up to
	1086	+ * balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate).
	1087	+ * Hence, to calculate "step" properly, we have to use bdi_dirty as
	1088	+ * "dirty" and bdi_setpoint as "setpoint".
	1089	+ *
	1090	+ * We rampup dirty_ratelimit forcibly if bdi_dirty is low because
	1091	+ * it's possible that bdi_thresh is close to zero due to inactivity
	1092	+ * of backing device (see the implementation of bdi_dirty_limit()).
	1093	+ */
	1094	+ if (unlikely(bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
	1095	+ dirty = bdi_dirty;
	1096	+ if (bdi_dirty < 8)
	1097	+ setpoint = bdi_dirty + 1;
	1098	+ else
	1099	+ setpoint = (bdi_thresh +
	1100	+ bdi_dirty_limit(bdi, bg_thresh)) / 2;
	1101	+ }
	1102	+
997	1103	if (dirty < setpoint) {
998	1104	x = min(bdi->balanced_dirty_ratelimit,
999	1105	min(balanced_dirty_ratelimit, task_ratelimit));
...	...	@@ -1198,6 +1304,56 @@
1198	1304	return pages >= DIRTY_POLL_THRESH ? 1 + t / 2 : t;
1199	1305	}
1200	1306
	1307	+static inline void bdi_dirty_limits(struct backing_dev_info *bdi,
	1308	+ unsigned long dirty_thresh,
	1309	+ unsigned long background_thresh,
	1310	+ unsigned long *bdi_dirty,
	1311	+ unsigned long *bdi_thresh,
	1312	+ unsigned long *bdi_bg_thresh)
	1313	+{
	1314	+ unsigned long bdi_reclaimable;
	1315	+
	1316	+ /*
	1317	+ * bdi_thresh is not treated as some limiting factor as
	1318	+ * dirty_thresh, due to reasons
	1319	+ * - in JBOD setup, bdi_thresh can fluctuate a lot
	1320	+ * - in a system with HDD and USB key, the USB key may somehow
	1321	+ * go into state (bdi_dirty >> bdi_thresh) either because
	1322	+ * bdi_dirty starts high, or because bdi_thresh drops low.
	1323	+ * In this case we don't want to hard throttle the USB key
	1324	+ * dirtiers for 100 seconds until bdi_dirty drops under
	1325	+ * bdi_thresh. Instead the auxiliary bdi control line in
	1326	+ * bdi_position_ratio() will let the dirtier task progress
	1327	+ * at some rate <= (write_bw / 2) for bringing down bdi_dirty.
	1328	+ */
	1329	+ *bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
	1330	+
	1331	+ if (bdi_bg_thresh)
	1332	+ bdi_bg_thresh = div_u64((u64)bdi_thresh *
	1333	+ background_thresh,
	1334	+ dirty_thresh);
	1335	+
	1336	+ /*
	1337	+ * In order to avoid the stacked BDI deadlock we need
	1338	+ * to ensure we accurately count the 'dirty' pages when
	1339	+ * the threshold is low.
	1340	+ *
	1341	+ * Otherwise it would be possible to get thresh+n pages
	1342	+ * reported dirty, even though there are thresh-m pages
	1343	+ * actually dirty; with m+n sitting in the percpu
	1344	+ * deltas.
	1345	+ */
	1346	+ if (bdi_thresh < 2 bdi_stat_error(bdi)) {
	1347	+ bdi_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
	1348	+ *bdi_dirty = bdi_reclaimable +
	1349	+ bdi_stat_sum(bdi, BDI_WRITEBACK);
	1350	+ } else {
	1351	+ bdi_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
	1352	+ *bdi_dirty = bdi_reclaimable +
	1353	+ bdi_stat(bdi, BDI_WRITEBACK);
	1354	+ }
	1355	+}
	1356	+
1201	1357	/*
1202	1358	* balance_dirty_pages() must be called by processes which are generating dirty
1203	1359	* data. It looks at the number of dirty pages in the machine and will force
1204	1360
1205	1361
...	...	@@ -1209,13 +1365,9 @@
1209	1365	unsigned long pages_dirtied)
1210	1366	{
1211	1367	unsigned long nr_reclaimable; /* = file_dirty + unstable_nfs */
1212		- unsigned long bdi_reclaimable;
1213	1368	unsigned long nr_dirty; /* = file_dirty + writeback + unstable_nfs */
1214		- unsigned long bdi_dirty;
1215		- unsigned long freerun;
1216	1369	unsigned long background_thresh;
1217	1370	unsigned long dirty_thresh;
1218		- unsigned long bdi_thresh;
1219	1371	long period;
1220	1372	long pause;
1221	1373	long max_pause;
1222	1374
...	...	@@ -1226,10 +1378,16 @@
1226	1378	unsigned long dirty_ratelimit;
1227	1379	unsigned long pos_ratio;
1228	1380	struct backing_dev_info *bdi = mapping->backing_dev_info;
	1381	+ bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT;
1229	1382	unsigned long start_time = jiffies;
1230	1383
1231	1384	for (;;) {
1232	1385	unsigned long now = jiffies;
	1386	+ unsigned long uninitialized_var(bdi_thresh);
	1387	+ unsigned long thresh;
	1388	+ unsigned long uninitialized_var(bdi_dirty);
	1389	+ unsigned long dirty;
	1390	+ unsigned long bg_thresh;
1233	1391
1234	1392	/*
1235	1393	* Unstable writes are a feature of certain networked
1236	1394
1237	1395
1238	1396
1239	1397
1240	1398
1241	1399
...	...	@@ -1243,61 +1401,44 @@
1243	1401
1244	1402	global_dirty_limits(&background_thresh, &dirty_thresh);
1245	1403
	1404	+ if (unlikely(strictlimit)) {
	1405	+ bdi_dirty_limits(bdi, dirty_thresh, background_thresh,
	1406	+ &bdi_dirty, &bdi_thresh, &bg_thresh);
	1407	+
	1408	+ dirty = bdi_dirty;
	1409	+ thresh = bdi_thresh;
	1410	+ } else {
	1411	+ dirty = nr_dirty;
	1412	+ thresh = dirty_thresh;
	1413	+ bg_thresh = background_thresh;
	1414	+ }
	1415	+
1246	1416	/*
1247	1417	* Throttle it only when the background writeback cannot
1248	1418	* catch-up. This avoids (excessively) small writeouts
1249		- * when the bdi limits are ramping up.
	1419	+ * when the bdi limits are ramping up in case of !strictlimit.
	1420	+ *
	1421	+ * In strictlimit case make decision based on the bdi counters
	1422	+ * and limits. Small writeouts when the bdi limits are ramping
	1423	+ * up are the price we consciously pay for strictlimit-ing.
1250	1424	*/
1251		- freerun = dirty_freerun_ceiling(dirty_thresh,
1252		- background_thresh);
1253		- if (nr_dirty <= freerun) {
	1425	+ if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh)) {
1254	1426	current->dirty_paused_when = now;
1255	1427	current->nr_dirtied = 0;
1256	1428	current->nr_dirtied_pause =
1257		- dirty_poll_interval(nr_dirty, dirty_thresh);
	1429	+ dirty_poll_interval(dirty, thresh);
1258	1430	break;
1259	1431	}
1260	1432
1261	1433	if (unlikely(!writeback_in_progress(bdi)))
1262	1434	bdi_start_background_writeback(bdi);
1263	1435
1264		- /*
1265		- * bdi_thresh is not treated as some limiting factor as
1266		- * dirty_thresh, due to reasons
1267		- * - in JBOD setup, bdi_thresh can fluctuate a lot
1268		- * - in a system with HDD and USB key, the USB key may somehow
1269		- * go into state (bdi_dirty >> bdi_thresh) either because
1270		- * bdi_dirty starts high, or because bdi_thresh drops low.
1271		- * In this case we don't want to hard throttle the USB key
1272		- * dirtiers for 100 seconds until bdi_dirty drops under
1273		- * bdi_thresh. Instead the auxiliary bdi control line in
1274		- * bdi_position_ratio() will let the dirtier task progress
1275		- * at some rate <= (write_bw / 2) for bringing down bdi_dirty.
1276		- */
1277		- bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
	1436	+ if (!strictlimit)
	1437	+ bdi_dirty_limits(bdi, dirty_thresh, background_thresh,
	1438	+ &bdi_dirty, &bdi_thresh, NULL);
1278	1439
1279		- /*
1280		- * In order to avoid the stacked BDI deadlock we need
1281		- * to ensure we accurately count the 'dirty' pages when
1282		- * the threshold is low.
1283		- *
1284		- * Otherwise it would be possible to get thresh+n pages
1285		- * reported dirty, even though there are thresh-m pages
1286		- * actually dirty; with m+n sitting in the percpu
1287		- * deltas.
1288		- */
1289		- if (bdi_thresh < 2 * bdi_stat_error(bdi)) {
1290		- bdi_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
1291		- bdi_dirty = bdi_reclaimable +
1292		- bdi_stat_sum(bdi, BDI_WRITEBACK);
1293		- } else {
1294		- bdi_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
1295		- bdi_dirty = bdi_reclaimable +
1296		- bdi_stat(bdi, BDI_WRITEBACK);
1297		- }
1298		-
1299	1440	dirty_exceeded = (bdi_dirty > bdi_thresh) &&
1300		- (nr_dirty > dirty_thresh);
	1441	+ ((nr_dirty > dirty_thresh) \|\| strictlimit);
1301	1442	if (dirty_exceeded && !bdi->dirty_exceeded)
1302	1443	bdi->dirty_exceeded = 1;
1303	1444