Commit 5a53748568f79641eaf40e41081a2f4987f005c2

Authored by Maxim Patlasov
Committed by Linus Torvalds
1 parent 4c3bffc272

mm/page-writeback.c: add strictlimit feature

The feature prevents mistrusted filesystems (ie: FUSE mounts created by
unprivileged users) to grow a large number of dirty pages before
throttling.  For such filesystems balance_dirty_pages always check bdi
counters against bdi limits.  I.e.  even if global "nr_dirty" is under
"freerun", it's not allowed to skip bdi checks.  The only use case for now
is fuse: it sets bdi max_ratio to 1% by default and system administrators
are supposed to expect that this limit won't be exceeded.

The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag.  A
filesystem may set the flag when it initializes its BDI.

The problematic scenario comes from the fact that nobody pays attention to
the NR_WRITEBACK_TEMP counter (i.e.  number of pages under fuse
writeback).  The implementation of fuse writeback releases original page
(by calling end_page_writeback) almost immediately.  A fuse request queued
for real processing bears a copy of original page.  Hence, if userspace
fuse daemon doesn't finalize write requests in timely manner, an
aggressive mmap writer can pollute virtually all memory by those temporary
fuse page copies.  They are carefully accounted in NR_WRITEBACK_TEMP, but
nobody cares.

To make further explanations shorter, let me use "NR_WRITEBACK_TEMP
problem" as a shortcut for "a possibility of uncontrolled grow of amount
of RAM consumed by temporary pages allocated by kernel fuse to process
writeback".

The problem was very easy to reproduce.  There is a trivial example
filesystem implementation in fuse userspace distribution: fusexmp_fh.c.  I
added "sleep(1);" to the write methods, then recompiled and mounted it.
Then created a huge file on the mount point and run a simple program which
mmap-ed the file to a memory region, then wrote a data to the region.  An
hour later I observed almost all RAM consumed by fuse writeback.  Since
then some unrelated changes in kernel fuse made it more difficult to
reproduce, but it is still possible now.

Putting this theoretical happens-in-the-lab thing aside, there is another
thing that really hurts real world (FUSE) users.  This is write-through
page cache policy FUSE currently uses.  I.e.  handling write(2), kernel
fuse populates page cache and flushes user data to the server
synchronously.  This is excessively suboptimal.  Pavel Emelyanov's patches
("writeback cache policy") solve the problem, but they also make resolving
NR_WRITEBACK_TEMP problem absolutely necessary.  Otherwise, simply copying
a huge file to a fuse mount would result in memory starvation.  Miklos,
the maintainer of FUSE, believes strictlimit feature the way to go.

And eventually putting FUSE topics aside, there is one more use-case for
strictlimit feature.  Using a slow USB stick (mass storage) in a machine
with huge amount of RAM installed is a well-known pain.  Let's make simple
computations.  Assuming 64GB of RAM installed, existing implementation of
balance_dirty_pages will start throttling only after 9.6GB of RAM becomes
dirty (freerun == 15% of total RAM).  So, the command "cp 9GB_file
/media/my-usb-storage/" may return in a few seconds, but subsequent
"umount /media/my-usb-storage/" will take more than two hours if effective
throughput of the storage is, to say, 1MB/sec.

After inclusion of strictlimit feature, it will be trivial to add a knob
(e.g.  /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand.
Manually or via udev rule.  May be I'm wrong, but it seems to be quite a
natural desire to limit the amount of dirty memory for some devices we are
not fully trust (in the sense of sustainable throughput).

[akpm@linux-foundation.org: fix warning in page-writeback.c]
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Showing 3 changed files with 206 additions and 62 deletions Side-by-side Diff

... ... @@ -930,7 +930,7 @@
930 930 fc->bdi.name = "fuse";
931 931 fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
932 932 /* fuse does it's own writeback accounting */
933   - fc->bdi.capabilities = BDI_CAP_NO_ACCT_WB;
  933 + fc->bdi.capabilities = BDI_CAP_NO_ACCT_WB | BDI_CAP_STRICTLIMIT;
934 934  
935 935 err = bdi_init(&fc->bdi);
936 936 if (err)
include/linux/backing-dev.h
... ... @@ -243,6 +243,8 @@
243 243 * BDI_CAP_EXEC_MAP: Can be mapped for execution
244 244 *
245 245 * BDI_CAP_SWAP_BACKED: Count shmem/tmpfs objects as swap-backed.
  246 + *
  247 + * BDI_CAP_STRICTLIMIT: Keep number of dirty pages below bdi threshold.
246 248 */
247 249 #define BDI_CAP_NO_ACCT_DIRTY 0x00000001
248 250 #define BDI_CAP_NO_WRITEBACK 0x00000002
... ... @@ -254,6 +256,7 @@
254 256 #define BDI_CAP_NO_ACCT_WB 0x00000080
255 257 #define BDI_CAP_SWAP_BACKED 0x00000100
256 258 #define BDI_CAP_STABLE_WRITES 0x00000200
  259 +#define BDI_CAP_STRICTLIMIT 0x00000400
257 260  
258 261 #define BDI_CAP_VMFLAGS \
259 262 (BDI_CAP_READ_MAP | BDI_CAP_WRITE_MAP | BDI_CAP_EXEC_MAP)
... ... @@ -585,6 +585,37 @@
585 585 }
586 586  
587 587 /*
  588 + * setpoint - dirty 3
  589 + * f(dirty) := 1.0 + (----------------)
  590 + * limit - setpoint
  591 + *
  592 + * it's a 3rd order polynomial that subjects to
  593 + *
  594 + * (1) f(freerun) = 2.0 => rampup dirty_ratelimit reasonably fast
  595 + * (2) f(setpoint) = 1.0 => the balance point
  596 + * (3) f(limit) = 0 => the hard limit
  597 + * (4) df/dx <= 0 => negative feedback control
  598 + * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
  599 + * => fast response on large errors; small oscillation near setpoint
  600 + */
  601 +static inline long long pos_ratio_polynom(unsigned long setpoint,
  602 + unsigned long dirty,
  603 + unsigned long limit)
  604 +{
  605 + long long pos_ratio;
  606 + long x;
  607 +
  608 + x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
  609 + limit - setpoint + 1);
  610 + pos_ratio = x;
  611 + pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
  612 + pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
  613 + pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
  614 +
  615 + return clamp(pos_ratio, 0LL, 2LL << RATELIMIT_CALC_SHIFT);
  616 +}
  617 +
  618 +/*
588 619 * Dirty position control.
589 620 *
590 621 * (o) global/bdi setpoints
591 622  
592 623  
593 624  
594 625  
... ... @@ -682,27 +713,81 @@
682 713 /*
683 714 * global setpoint
684 715 *
685   - * setpoint - dirty 3
686   - * f(dirty) := 1.0 + (----------------)
687   - * limit - setpoint
  716 + * See comment for pos_ratio_polynom().
  717 + */
  718 + setpoint = (freerun + limit) / 2;
  719 + pos_ratio = pos_ratio_polynom(setpoint, dirty, limit);
  720 +
  721 + /*
  722 + * The strictlimit feature is a tool preventing mistrusted filesystems
  723 + * from growing a large number of dirty pages before throttling. For
  724 + * such filesystems balance_dirty_pages always checks bdi counters
  725 + * against bdi limits. Even if global "nr_dirty" is under "freerun".
  726 + * This is especially important for fuse which sets bdi->max_ratio to
  727 + * 1% by default. Without strictlimit feature, fuse writeback may
  728 + * consume arbitrary amount of RAM because it is accounted in
  729 + * NR_WRITEBACK_TEMP which is not involved in calculating "nr_dirty".
688 730 *
689   - * it's a 3rd order polynomial that subjects to
  731 + * Here, in bdi_position_ratio(), we calculate pos_ratio based on
  732 + * two values: bdi_dirty and bdi_thresh. Let's consider an example:
  733 + * total amount of RAM is 16GB, bdi->max_ratio is equal to 1%, global
  734 + * limits are set by default to 10% and 20% (background and throttle).
  735 + * Then bdi_thresh is 1% of 20% of 16GB. This amounts to ~8K pages.
  736 + * bdi_dirty_limit(bdi, bg_thresh) is about ~4K pages. bdi_setpoint is
  737 + * about ~6K pages (as the average of background and throttle bdi
  738 + * limits). The 3rd order polynomial will provide positive feedback if
  739 + * bdi_dirty is under bdi_setpoint and vice versa.
690 740 *
691   - * (1) f(freerun) = 2.0 => rampup dirty_ratelimit reasonably fast
692   - * (2) f(setpoint) = 1.0 => the balance point
693   - * (3) f(limit) = 0 => the hard limit
694   - * (4) df/dx <= 0 => negative feedback control
695   - * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
696   - * => fast response on large errors; small oscillation near setpoint
  741 + * Note, that we cannot use global counters in these calculations
  742 + * because we want to throttle process writing to a strictlimit BDI
  743 + * much earlier than global "freerun" is reached (~23MB vs. ~2.3GB
  744 + * in the example above).
697 745 */
698   - setpoint = (freerun + limit) / 2;
699   - x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
700   - limit - setpoint + 1);
701   - pos_ratio = x;
702   - pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
703   - pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
704   - pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
  746 + if (unlikely(bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
  747 + long long bdi_pos_ratio;
  748 + unsigned long bdi_bg_thresh;
705 749  
  750 + if (bdi_dirty < 8)
  751 + return min_t(long long, pos_ratio * 2,
  752 + 2 << RATELIMIT_CALC_SHIFT);
  753 +
  754 + if (bdi_dirty >= bdi_thresh)
  755 + return 0;
  756 +
  757 + bdi_bg_thresh = div_u64((u64)bdi_thresh * bg_thresh, thresh);
  758 + bdi_setpoint = dirty_freerun_ceiling(bdi_thresh,
  759 + bdi_bg_thresh);
  760 +
  761 + if (bdi_setpoint == 0 || bdi_setpoint == bdi_thresh)
  762 + return 0;
  763 +
  764 + bdi_pos_ratio = pos_ratio_polynom(bdi_setpoint, bdi_dirty,
  765 + bdi_thresh);
  766 +
  767 + /*
  768 + * Typically, for strictlimit case, bdi_setpoint << setpoint
  769 + * and pos_ratio >> bdi_pos_ratio. In the other words global
  770 + * state ("dirty") is not limiting factor and we have to
  771 + * make decision based on bdi counters. But there is an
  772 + * important case when global pos_ratio should get precedence:
  773 + * global limits are exceeded (e.g. due to activities on other
  774 + * BDIs) while given strictlimit BDI is below limit.
  775 + *
  776 + * "pos_ratio * bdi_pos_ratio" would work for the case above,
  777 + * but it would look too non-natural for the case of all
  778 + * activity in the system coming from a single strictlimit BDI
  779 + * with bdi->max_ratio == 100%.
  780 + *
  781 + * Note that min() below somewhat changes the dynamics of the
  782 + * control system. Normally, pos_ratio value can be well over 3
  783 + * (when globally we are at freerun and bdi is well below bdi
  784 + * setpoint). Now the maximum pos_ratio in the same situation
  785 + * is 2. We might want to tweak this if we observe the control
  786 + * system is too slow to adapt.
  787 + */
  788 + return min(pos_ratio, bdi_pos_ratio);
  789 + }
  790 +
706 791 /*
707 792 * We have computed basic pos_ratio above based on global situation. If
708 793 * the bdi is over/under its share of dirty pages, we want to scale
... ... @@ -994,6 +1079,27 @@
994 1079 * keep that period small to reduce time lags).
995 1080 */
996 1081 step = 0;
  1082 +
  1083 + /*
  1084 + * For strictlimit case, calculations above were based on bdi counters
  1085 + * and limits (starting from pos_ratio = bdi_position_ratio() and up to
  1086 + * balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate).
  1087 + * Hence, to calculate "step" properly, we have to use bdi_dirty as
  1088 + * "dirty" and bdi_setpoint as "setpoint".
  1089 + *
  1090 + * We rampup dirty_ratelimit forcibly if bdi_dirty is low because
  1091 + * it's possible that bdi_thresh is close to zero due to inactivity
  1092 + * of backing device (see the implementation of bdi_dirty_limit()).
  1093 + */
  1094 + if (unlikely(bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
  1095 + dirty = bdi_dirty;
  1096 + if (bdi_dirty < 8)
  1097 + setpoint = bdi_dirty + 1;
  1098 + else
  1099 + setpoint = (bdi_thresh +
  1100 + bdi_dirty_limit(bdi, bg_thresh)) / 2;
  1101 + }
  1102 +
997 1103 if (dirty < setpoint) {
998 1104 x = min(bdi->balanced_dirty_ratelimit,
999 1105 min(balanced_dirty_ratelimit, task_ratelimit));
... ... @@ -1198,6 +1304,56 @@
1198 1304 return pages >= DIRTY_POLL_THRESH ? 1 + t / 2 : t;
1199 1305 }
1200 1306  
  1307 +static inline void bdi_dirty_limits(struct backing_dev_info *bdi,
  1308 + unsigned long dirty_thresh,
  1309 + unsigned long background_thresh,
  1310 + unsigned long *bdi_dirty,
  1311 + unsigned long *bdi_thresh,
  1312 + unsigned long *bdi_bg_thresh)
  1313 +{
  1314 + unsigned long bdi_reclaimable;
  1315 +
  1316 + /*
  1317 + * bdi_thresh is not treated as some limiting factor as
  1318 + * dirty_thresh, due to reasons
  1319 + * - in JBOD setup, bdi_thresh can fluctuate a lot
  1320 + * - in a system with HDD and USB key, the USB key may somehow
  1321 + * go into state (bdi_dirty >> bdi_thresh) either because
  1322 + * bdi_dirty starts high, or because bdi_thresh drops low.
  1323 + * In this case we don't want to hard throttle the USB key
  1324 + * dirtiers for 100 seconds until bdi_dirty drops under
  1325 + * bdi_thresh. Instead the auxiliary bdi control line in
  1326 + * bdi_position_ratio() will let the dirtier task progress
  1327 + * at some rate <= (write_bw / 2) for bringing down bdi_dirty.
  1328 + */
  1329 + *bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
  1330 +
  1331 + if (bdi_bg_thresh)
  1332 + *bdi_bg_thresh = div_u64((u64)*bdi_thresh *
  1333 + background_thresh,
  1334 + dirty_thresh);
  1335 +
  1336 + /*
  1337 + * In order to avoid the stacked BDI deadlock we need
  1338 + * to ensure we accurately count the 'dirty' pages when
  1339 + * the threshold is low.
  1340 + *
  1341 + * Otherwise it would be possible to get thresh+n pages
  1342 + * reported dirty, even though there are thresh-m pages
  1343 + * actually dirty; with m+n sitting in the percpu
  1344 + * deltas.
  1345 + */
  1346 + if (*bdi_thresh < 2 * bdi_stat_error(bdi)) {
  1347 + bdi_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
  1348 + *bdi_dirty = bdi_reclaimable +
  1349 + bdi_stat_sum(bdi, BDI_WRITEBACK);
  1350 + } else {
  1351 + bdi_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
  1352 + *bdi_dirty = bdi_reclaimable +
  1353 + bdi_stat(bdi, BDI_WRITEBACK);
  1354 + }
  1355 +}
  1356 +
1201 1357 /*
1202 1358 * balance_dirty_pages() must be called by processes which are generating dirty
1203 1359 * data. It looks at the number of dirty pages in the machine and will force
1204 1360  
1205 1361  
... ... @@ -1209,13 +1365,9 @@
1209 1365 unsigned long pages_dirtied)
1210 1366 {
1211 1367 unsigned long nr_reclaimable; /* = file_dirty + unstable_nfs */
1212   - unsigned long bdi_reclaimable;
1213 1368 unsigned long nr_dirty; /* = file_dirty + writeback + unstable_nfs */
1214   - unsigned long bdi_dirty;
1215   - unsigned long freerun;
1216 1369 unsigned long background_thresh;
1217 1370 unsigned long dirty_thresh;
1218   - unsigned long bdi_thresh;
1219 1371 long period;
1220 1372 long pause;
1221 1373 long max_pause;
1222 1374  
... ... @@ -1226,10 +1378,16 @@
1226 1378 unsigned long dirty_ratelimit;
1227 1379 unsigned long pos_ratio;
1228 1380 struct backing_dev_info *bdi = mapping->backing_dev_info;
  1381 + bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT;
1229 1382 unsigned long start_time = jiffies;
1230 1383  
1231 1384 for (;;) {
1232 1385 unsigned long now = jiffies;
  1386 + unsigned long uninitialized_var(bdi_thresh);
  1387 + unsigned long thresh;
  1388 + unsigned long uninitialized_var(bdi_dirty);
  1389 + unsigned long dirty;
  1390 + unsigned long bg_thresh;
1233 1391  
1234 1392 /*
1235 1393 * Unstable writes are a feature of certain networked
1236 1394  
1237 1395  
1238 1396  
1239 1397  
1240 1398  
1241 1399  
... ... @@ -1243,61 +1401,44 @@
1243 1401  
1244 1402 global_dirty_limits(&background_thresh, &dirty_thresh);
1245 1403  
  1404 + if (unlikely(strictlimit)) {
  1405 + bdi_dirty_limits(bdi, dirty_thresh, background_thresh,
  1406 + &bdi_dirty, &bdi_thresh, &bg_thresh);
  1407 +
  1408 + dirty = bdi_dirty;
  1409 + thresh = bdi_thresh;
  1410 + } else {
  1411 + dirty = nr_dirty;
  1412 + thresh = dirty_thresh;
  1413 + bg_thresh = background_thresh;
  1414 + }
  1415 +
1246 1416 /*
1247 1417 * Throttle it only when the background writeback cannot
1248 1418 * catch-up. This avoids (excessively) small writeouts
1249   - * when the bdi limits are ramping up.
  1419 + * when the bdi limits are ramping up in case of !strictlimit.
  1420 + *
  1421 + * In strictlimit case make decision based on the bdi counters
  1422 + * and limits. Small writeouts when the bdi limits are ramping
  1423 + * up are the price we consciously pay for strictlimit-ing.
1250 1424 */
1251   - freerun = dirty_freerun_ceiling(dirty_thresh,
1252   - background_thresh);
1253   - if (nr_dirty <= freerun) {
  1425 + if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh)) {
1254 1426 current->dirty_paused_when = now;
1255 1427 current->nr_dirtied = 0;
1256 1428 current->nr_dirtied_pause =
1257   - dirty_poll_interval(nr_dirty, dirty_thresh);
  1429 + dirty_poll_interval(dirty, thresh);
1258 1430 break;
1259 1431 }
1260 1432  
1261 1433 if (unlikely(!writeback_in_progress(bdi)))
1262 1434 bdi_start_background_writeback(bdi);
1263 1435  
1264   - /*
1265   - * bdi_thresh is not treated as some limiting factor as
1266   - * dirty_thresh, due to reasons
1267   - * - in JBOD setup, bdi_thresh can fluctuate a lot
1268   - * - in a system with HDD and USB key, the USB key may somehow
1269   - * go into state (bdi_dirty >> bdi_thresh) either because
1270   - * bdi_dirty starts high, or because bdi_thresh drops low.
1271   - * In this case we don't want to hard throttle the USB key
1272   - * dirtiers for 100 seconds until bdi_dirty drops under
1273   - * bdi_thresh. Instead the auxiliary bdi control line in
1274   - * bdi_position_ratio() will let the dirtier task progress
1275   - * at some rate <= (write_bw / 2) for bringing down bdi_dirty.
1276   - */
1277   - bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
  1436 + if (!strictlimit)
  1437 + bdi_dirty_limits(bdi, dirty_thresh, background_thresh,
  1438 + &bdi_dirty, &bdi_thresh, NULL);
1278 1439  
1279   - /*
1280   - * In order to avoid the stacked BDI deadlock we need
1281   - * to ensure we accurately count the 'dirty' pages when
1282   - * the threshold is low.
1283   - *
1284   - * Otherwise it would be possible to get thresh+n pages
1285   - * reported dirty, even though there are thresh-m pages
1286   - * actually dirty; with m+n sitting in the percpu
1287   - * deltas.
1288   - */
1289   - if (bdi_thresh < 2 * bdi_stat_error(bdi)) {
1290   - bdi_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
1291   - bdi_dirty = bdi_reclaimable +
1292   - bdi_stat_sum(bdi, BDI_WRITEBACK);
1293   - } else {
1294   - bdi_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
1295   - bdi_dirty = bdi_reclaimable +
1296   - bdi_stat(bdi, BDI_WRITEBACK);
1297   - }
1298   -
1299 1440 dirty_exceeded = (bdi_dirty > bdi_thresh) &&
1300   - (nr_dirty > dirty_thresh);
  1441 + ((nr_dirty > dirty_thresh) || strictlimit);
1301 1442 if (dirty_exceeded && !bdi->dirty_exceeded)
1302 1443 bdi->dirty_exceeded = 1;
1303 1444