Commit 239567033c38933c4d6f402f9f8a2126df73e4c6

Authored by Dave Chinner
Committed by Ben Myers
1 parent b121099d84

xfs: inode log reservations are too small

We've been seeing occasional problems with log space leaks and
transaction underruns such as this for some time:

 XFS (dm-0): xlog_write: reservation summary:
   trans type  = FSYNC_TS (36)
   unit res    = 2740 bytes
   current res = -4 bytes
   total reg   = 0 bytes (o/flow = 0 bytes)
   ophdrs      = 0 (ophdr space = 0 bytes)
   ophdr + reg = 0 bytes
   num regions = 0

Turns out that xfstests generic/311 is reliably reproducing this
problem with the test it runs at sequence 16 of it execution. It is
a 100% reliable reproducer with the mkfs configuration of "-b
size=1024 -m crc=1" on a 10GB scratch device.

The problem? Inode forks in btree format are logged in memory
format, not disk format (i.e. bmbt format, not bmdr format). That
means there is a btree block header being logged, when such a
structure is never written to the inode fork in bmdr format. The
bmdr header in the inode is only 4 bytes, while the bmbt header is
24 bytes for v4 filesystems and 72 bytes for v5 filesystems.

We currently reserve the inode size plus the rounded up overhead of
a logging a buffer, which is 128 bytes. That means the reservation
for a 512 byte inode is 640 bytes. What we can actually log is:

	inode core, data and attr fork = 512 bytes
	inode log format + log op header = 56 + 12 = 68 bytes
	data fork bmbt hdr = 24/72 bytes
	attr fork bmbt hdr = 24/72 bytes

So, for a v2 inodes we can log at least 628 bytes, but if we split that
inode over the end of the log across log buffers, we need to also
another log op header, which takes us to 640 bytes. If there's
another reservation taken out of this that I haven't taken into
account (perhaps multiple iclog splits?) or I haven't corectly
calculated the bmbt format space used (entirely possible), then
we will overun it.

For v3 inodes the maximum is actually 724 bytes, and even a
single maximally sized btree format fork can blow it (652 bytes).
And that's exactly what is happening with the FSYNC_TS transaction
in the above output - it's consumed 644 bytes of space after the CIL
context took the space reserved for it (2100 bytes).

This problem has always been present in the XFS code - the btree
format inode forks have always been logged in this manner. Hence
there has always been the possibility of an overrun with such a
transaction. The CRC code has just exposed it frequently enough to
be able to debug and understand the root cause....

So, let's fix all the inode log space reservations.

[ I'm so glad we spent the effort to clean up the transaction
  reservation code. This is an easy fix now. ]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

Showing 1 changed file with 53 additions and 19 deletions Side-by-side Diff

fs/xfs/xfs_trans_resv.c
... ... @@ -73,6 +73,39 @@
73 73 }
74 74  
75 75 /*
  76 + * Logging inodes is really tricksy. They are logged in memory format,
  77 + * which means that what we write into the log doesn't directly translate into
  78 + * the amount of space they use on disk.
  79 + *
  80 + * Case in point - btree format forks in memory format use more space than the
  81 + * on-disk format. In memory, the buffer contains a normal btree block header so
  82 + * the btree code can treat it as though it is just another generic buffer.
  83 + * However, when we write it to the inode fork, we don't write all of this
  84 + * header as it isn't needed. e.g. the root is only ever in the inode, so
  85 + * there's no need for sibling pointers which would waste 16 bytes of space.
  86 + *
  87 + * Hence when we have an inode with a maximally sized btree format fork, then
  88 + * amount of information we actually log is greater than the size of the inode
  89 + * on disk. Hence we need an inode reservation function that calculates all this
  90 + * correctly. So, we log:
  91 + *
  92 + * - log op headers for object
  93 + * - inode log format object
  94 + * - the entire inode contents (core + 2 forks)
  95 + * - two bmap btree block headers
  96 + */
  97 +STATIC uint
  98 +xfs_calc_inode_res(
  99 + struct xfs_mount *mp,
  100 + uint ninodes)
  101 +{
  102 + return ninodes * (sizeof(struct xlog_op_header) +
  103 + sizeof(struct xfs_inode_log_format) +
  104 + mp->m_sb.sb_inodesize +
  105 + 2 * XFS_BMBT_BLOCK_LEN(mp));
  106 +}
  107 +
  108 +/*
76 109 * Various log reservation values.
77 110 *
78 111 * These are based on the size of the file system block because that is what
... ... @@ -111,7 +144,7 @@
111 144 struct xfs_mount *mp)
112 145 {
113 146 return XFS_DQUOT_LOGRES(mp) +
114   - MAX((xfs_calc_buf_res(1, mp->m_sb.sb_inodesize) +
  147 + MAX((xfs_calc_inode_res(mp, 1) +
115 148 xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK),
116 149 XFS_FSB_TO_B(mp, 1)) +
117 150 xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
... ... @@ -140,7 +173,7 @@
140 173 struct xfs_mount *mp)
141 174 {
142 175 return XFS_DQUOT_LOGRES(mp) +
143   - MAX((xfs_calc_buf_res(1, mp->m_sb.sb_inodesize) +
  176 + MAX((xfs_calc_inode_res(mp, 1) +
144 177 xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) + 1,
145 178 XFS_FSB_TO_B(mp, 1))),
146 179 (xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
... ... @@ -170,7 +203,7 @@
170 203 struct xfs_mount *mp)
171 204 {
172 205 return XFS_DQUOT_LOGRES(mp) +
173   - MAX((xfs_calc_buf_res(4, mp->m_sb.sb_inodesize) +
  206 + MAX((xfs_calc_inode_res(mp, 4) +
174 207 xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp),
175 208 XFS_FSB_TO_B(mp, 1))),
176 209 (xfs_calc_buf_res(7, mp->m_sb.sb_sectsize) +
... ... @@ -195,7 +228,7 @@
195 228 struct xfs_mount *mp)
196 229 {
197 230 return XFS_DQUOT_LOGRES(mp) +
198   - MAX((xfs_calc_buf_res(2, mp->m_sb.sb_inodesize) +
  231 + MAX((xfs_calc_inode_res(mp, 2) +
199 232 xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp),
200 233 XFS_FSB_TO_B(mp, 1))),
201 234 (xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
... ... @@ -220,7 +253,7 @@
220 253 struct xfs_mount *mp)
221 254 {
222 255 return XFS_DQUOT_LOGRES(mp) +
223   - MAX((xfs_calc_buf_res(2, mp->m_sb.sb_inodesize) +
  256 + MAX((xfs_calc_inode_res(mp, 2) +
224 257 xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp),
225 258 XFS_FSB_TO_B(mp, 1))),
226 259 (xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
... ... @@ -247,7 +280,7 @@
247 280 xfs_calc_create_resv_modify(
248 281 struct xfs_mount *mp)
249 282 {
250   - return xfs_calc_buf_res(2, mp->m_sb.sb_inodesize) +
  283 + return xfs_calc_inode_res(mp, 2) +
251 284 xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
252 285 (uint)XFS_FSB_TO_B(mp, 1) +
253 286 xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1));
... ... @@ -357,7 +390,7 @@
357 390 struct xfs_mount *mp)
358 391 {
359 392 return XFS_DQUOT_LOGRES(mp) +
360   - xfs_calc_buf_res(1, mp->m_sb.sb_inodesize) +
  393 + xfs_calc_inode_res(mp, 1) +
361 394 xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
362 395 xfs_calc_buf_res(1, XFS_FSB_TO_B(mp, 1)) +
363 396 MAX((__uint16_t)XFS_FSB_TO_B(mp, 1),
... ... @@ -378,9 +411,8 @@
378 411 struct xfs_mount *mp)
379 412 {
380 413 return XFS_DQUOT_LOGRES(mp) +
381   - mp->m_sb.sb_inodesize +
382   - mp->m_sb.sb_sectsize +
383   - 512;
  414 + xfs_calc_inode_res(mp, 1) +
  415 + xfs_calc_buf_res(1, mp->m_sb.sb_sectsize);
384 416  
385 417 }
386 418  
... ... @@ -416,7 +448,7 @@
416 448 return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
417 449 xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK),
418 450 XFS_FSB_TO_B(mp, 1)) +
419   - xfs_calc_buf_res(1, mp->m_sb.sb_inodesize) +
  451 + xfs_calc_inode_res(mp, 1) +
420 452 xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
421 453 XFS_FSB_TO_B(mp, 1));
422 454 }
... ... @@ -448,7 +480,7 @@
448 480 struct xfs_mount *mp)
449 481 {
450 482 return xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
451   - xfs_calc_buf_res(2, mp->m_sb.sb_inodesize) +
  483 + xfs_calc_inode_res(mp, 2) +
452 484 xfs_calc_buf_res(1, mp->m_sb.sb_blocksize) +
453 485 xfs_calc_buf_res(1, mp->m_rsumsize);
454 486 }
... ... @@ -461,7 +493,7 @@
461 493 xfs_calc_swrite_reservation(
462 494 struct xfs_mount *mp)
463 495 {
464   - return xfs_calc_buf_res(1, mp->m_sb.sb_inodesize);
  496 + return xfs_calc_inode_res(mp, 1);
465 497 }
466 498  
467 499 /*
468 500  
... ... @@ -469,9 +501,10 @@
469 501 * inode
470 502 */
471 503 STATIC uint
472   -xfs_calc_writeid_reservation(xfs_mount_t *mp)
  504 +xfs_calc_writeid_reservation(
  505 + struct xfs_mount *mp)
473 506 {
474   - return xfs_calc_buf_res(1, mp->m_sb.sb_inodesize);
  507 + return xfs_calc_inode_res(mp, 1);
475 508 }
476 509  
477 510 /*
... ... @@ -487,7 +520,7 @@
487 520 struct xfs_mount *mp)
488 521 {
489 522 return XFS_DQUOT_LOGRES(mp) +
490   - xfs_calc_buf_res(1, mp->m_sb.sb_inodesize) +
  523 + xfs_calc_inode_res(mp, 1) +
491 524 xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
492 525 xfs_calc_buf_res(1, mp->m_dirblksize) +
493 526 xfs_calc_buf_res(XFS_DAENTER_BMAP1B(mp, XFS_DATA_FORK) + 1,
... ... @@ -511,7 +544,7 @@
511 544 xfs_calc_attrinval_reservation(
512 545 struct xfs_mount *mp)
513 546 {
514   - return MAX((xfs_calc_buf_res(1, mp->m_sb.sb_inodesize) +
  547 + return MAX((xfs_calc_inode_res(mp, 1) +
515 548 xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK),
516 549 XFS_FSB_TO_B(mp, 1))),
517 550 (xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
... ... @@ -535,7 +568,7 @@
535 568 struct xfs_mount *mp)
536 569 {
537 570 return XFS_DQUOT_LOGRES(mp) +
538   - xfs_calc_buf_res(1, mp->m_sb.sb_inodesize) +
  571 + xfs_calc_inode_res(mp, 1) +
539 572 xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
540 573 xfs_calc_buf_res(XFS_DA_NODE_MAXDEPTH, XFS_FSB_TO_B(mp, 1));
541 574 }
... ... @@ -575,7 +608,7 @@
575 608 struct xfs_mount *mp)
576 609 {
577 610 return XFS_DQUOT_LOGRES(mp) +
578   - MAX((xfs_calc_buf_res(1, mp->m_sb.sb_inodesize) +
  611 + MAX((xfs_calc_inode_res(mp, 1) +
579 612 xfs_calc_buf_res(XFS_DA_NODE_MAXDEPTH,
580 613 XFS_FSB_TO_B(mp, 1)) +
581 614 (uint)XFS_FSB_TO_B(mp,
... ... @@ -627,6 +660,7 @@
627 660 xfs_calc_qm_dqalloc_reservation(
628 661 struct xfs_mount *mp)
629 662 {
  663 + ASSERT(M_RES(mp)->tr_write.tr_logres);
630 664 return M_RES(mp)->tr_write.tr_logres +
631 665 xfs_calc_buf_res(1,
632 666 XFS_FSB_TO_B(mp, XFS_DQUOT_CLUSTER_SIZE_FSB) - 1);