Commit bfab36e81611e60573b84eb4e4b4c8d8545b2320

Authored by Anton Altaparmakov
Committed by Linus Torvalds
1 parent f26e51f67a

NTFS: Fix a mount time deadlock.

Big thanks go to Mathias Kolehmainen for reporting the bug, providing
debug output and testing the patches I sent him to get it working.

The fix was to stop calling ntfs_attr_set() at mount time as that causes
balance_dirty_pages_ratelimited() to be called which on systems with
little memory actually tries to go and balance the dirty pages which tries
to take the s_umount semaphore but because we are still in fill_super()
across which the VFS holds s_umount for writing this results in a
deadlock.

We now do the dirty work by hand by submitting individual buffers.  This
has the annoying "feature" that mounting can take a few seconds if the
journal is large as we have clear it all.  One day someone should improve
on this by deferring the journal clearing to a helper kernel thread so it
can be done in the background but I don't have time for this at the moment
and the current solution works fine so I am leaving it like this for now.

Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Showing 9 changed files with 181 additions and 53 deletions Side-by-side Diff

Documentation/filesystems/ntfs.txt
... ... @@ -407,7 +407,7 @@
407 407 device /dev/hda5
408 408 raid-disk 0
409 409 device /dev/hdb1
410   - raid-disl 1
  410 + raid-disk 1
411 411  
412 412 For linear raid, just change the raid-level above to "raid-level linear", for
413 413 mirrors, change it to "raid-level 1", and for stripe sets with parity, change
... ... @@ -457,6 +457,8 @@
457 457  
458 458 Note, a technical ChangeLog aimed at kernel hackers is in fs/ntfs/ChangeLog.
459 459  
  460 +2.1.29:
  461 + - Fix a deadlock when mounting read-write.
460 462 2.1.28:
461 463 - Fix a deadlock.
462 464 2.1.27:
... ... @@ -17,6 +17,18 @@
17 17 happen is unclear however so it is worth waiting until someone hits
18 18 the problem.
19 19  
  20 +2.1.29 - Fix a deadlock at mount time.
  21 +
  22 + - During mount the VFS holds s_umount lock on the superblock. So when
  23 + we try to empty the journal $LogFile contents by calling
  24 + ntfs_attr_set() when the machine does not have much memory and the
  25 + journal is large ntfs_attr_set() results in the VM trying to balance
  26 + dirty pages which in turn tries to that the s_umount lock and thus we
  27 + get a deadlock. The solution is to not use ntfs_attr_set() and
  28 + instead do the zeroing by hand at the block level rather than page
  29 + cache level.
  30 + - Fix sparse warnings.
  31 +
20 32 2.1.28 - Fix a deadlock.
21 33  
22 34 - Fix deadlock in fs/ntfs/inode.c::ntfs_put_inode(). Thanks to Sergey
... ... @@ -6,7 +6,7 @@
6 6 index.o inode.o mft.o mst.o namei.o runlist.o super.o sysctl.o \
7 7 unistr.o upcase.o
8 8  
9   -EXTRA_CFLAGS = -DNTFS_VERSION=\"2.1.28\"
  9 +EXTRA_CFLAGS = -DNTFS_VERSION=\"2.1.29\"
10 10  
11 11 ifeq ($(CONFIG_NTFS_DEBUG),y)
12 12 EXTRA_CFLAGS += -DDEBUG
... ... @@ -2,7 +2,7 @@
2 2 * aops.c - NTFS kernel address space operations and page cache handling.
3 3 * Part of the Linux-NTFS project.
4 4 *
5   - * Copyright (c) 2001-2006 Anton Altaparmakov
  5 + * Copyright (c) 2001-2007 Anton Altaparmakov
6 6 * Copyright (c) 2002 Richard Russon
7 7 *
8 8 * This program/include file is free software; you can redistribute it and/or
... ... @@ -396,7 +396,7 @@
396 396 loff_t i_size;
397 397 struct inode *vi;
398 398 ntfs_inode *ni, *base_ni;
399   - u8 *kaddr;
  399 + u8 *addr;
400 400 ntfs_attr_search_ctx *ctx;
401 401 MFT_RECORD *mrec;
402 402 unsigned long flags;
403 403  
404 404  
405 405  
... ... @@ -491,15 +491,15 @@
491 491 /* Race with shrinking truncate. */
492 492 attr_len = i_size;
493 493 }
494   - kaddr = kmap_atomic(page, KM_USER0);
  494 + addr = kmap_atomic(page, KM_USER0);
495 495 /* Copy the data to the page. */
496   - memcpy(kaddr, (u8*)ctx->attr +
  496 + memcpy(addr, (u8*)ctx->attr +
497 497 le16_to_cpu(ctx->attr->data.resident.value_offset),
498 498 attr_len);
499 499 /* Zero the remainder of the page. */
500   - memset(kaddr + attr_len, 0, PAGE_CACHE_SIZE - attr_len);
  500 + memset(addr + attr_len, 0, PAGE_CACHE_SIZE - attr_len);
501 501 flush_dcache_page(page);
502   - kunmap_atomic(kaddr, KM_USER0);
  502 + kunmap_atomic(addr, KM_USER0);
503 503 put_unm_err_out:
504 504 ntfs_attr_put_search_ctx(ctx);
505 505 unm_err_out:
... ... @@ -1344,7 +1344,7 @@
1344 1344 loff_t i_size;
1345 1345 struct inode *vi = page->mapping->host;
1346 1346 ntfs_inode *base_ni = NULL, *ni = NTFS_I(vi);
1347   - char *kaddr;
  1347 + char *addr;
1348 1348 ntfs_attr_search_ctx *ctx = NULL;
1349 1349 MFT_RECORD *m = NULL;
1350 1350 u32 attr_len;
1351 1351  
1352 1352  
... ... @@ -1484,14 +1484,14 @@
1484 1484 /* Shrinking cannot fail. */
1485 1485 BUG_ON(err);
1486 1486 }
1487   - kaddr = kmap_atomic(page, KM_USER0);
  1487 + addr = kmap_atomic(page, KM_USER0);
1488 1488 /* Copy the data from the page to the mft record. */
1489 1489 memcpy((u8*)ctx->attr +
1490 1490 le16_to_cpu(ctx->attr->data.resident.value_offset),
1491   - kaddr, attr_len);
  1491 + addr, attr_len);
1492 1492 /* Zero out of bounds area in the page cache page. */
1493   - memset(kaddr + attr_len, 0, PAGE_CACHE_SIZE - attr_len);
1494   - kunmap_atomic(kaddr, KM_USER0);
  1493 + memset(addr + attr_len, 0, PAGE_CACHE_SIZE - attr_len);
  1494 + kunmap_atomic(addr, KM_USER0);
1495 1495 flush_dcache_page(page);
1496 1496 flush_dcache_mft_record_page(ctx->ntfs_ino);
1497 1497 /* We are done with the page. */
1 1 /**
2 2 * attrib.c - NTFS attribute operations. Part of the Linux-NTFS project.
3 3 *
4   - * Copyright (c) 2001-2006 Anton Altaparmakov
  4 + * Copyright (c) 2001-2007 Anton Altaparmakov
5 5 * Copyright (c) 2002 Richard Russon
6 6 *
7 7 * This program/include file is free software; you can redistribute it and/or
... ... @@ -2500,7 +2500,7 @@
2500 2500 struct page *page;
2501 2501 u8 *kaddr;
2502 2502 pgoff_t idx, end;
2503   - unsigned int start_ofs, end_ofs, size;
  2503 + unsigned start_ofs, end_ofs, size;
2504 2504  
2505 2505 ntfs_debug("Entering for ofs 0x%llx, cnt 0x%llx, val 0x%hx.",
2506 2506 (long long)ofs, (long long)cnt, val);
... ... @@ -2548,6 +2548,8 @@
2548 2548 kunmap_atomic(kaddr, KM_USER0);
2549 2549 set_page_dirty(page);
2550 2550 page_cache_release(page);
  2551 + balance_dirty_pages_ratelimited(mapping);
  2552 + cond_resched();
2551 2553 if (idx == end)
2552 2554 goto done;
2553 2555 idx++;
... ... @@ -2604,6 +2606,8 @@
2604 2606 kunmap_atomic(kaddr, KM_USER0);
2605 2607 set_page_dirty(page);
2606 2608 page_cache_release(page);
  2609 + balance_dirty_pages_ratelimited(mapping);
  2610 + cond_resched();
2607 2611 }
2608 2612 done:
2609 2613 ntfs_debug("Done.");
1 1 /*
2 2 * file.c - NTFS kernel file operations. Part of the Linux-NTFS project.
3 3 *
4   - * Copyright (c) 2001-2006 Anton Altaparmakov
  4 + * Copyright (c) 2001-2007 Anton Altaparmakov
5 5 *
6 6 * This program/include file is free software; you can redistribute it and/or
7 7 * modify it under the terms of the GNU General Public License as published
... ... @@ -26,7 +26,6 @@
26 26 #include <linux/swap.h>
27 27 #include <linux/uio.h>
28 28 #include <linux/writeback.h>
29   -#include <linux/sched.h>
30 29  
31 30 #include <asm/page.h>
32 31 #include <asm/uaccess.h>
... ... @@ -362,7 +361,7 @@
362 361 volatile char c;
363 362  
364 363 /* Set @end to the first byte outside the last page we care about. */
365   - end = (const char __user*)PAGE_ALIGN((ptrdiff_t __user)uaddr + bytes);
  364 + end = (const char __user*)PAGE_ALIGN((unsigned long)uaddr + bytes);
366 365  
367 366 while (!__get_user(c, uaddr) && (uaddr += PAGE_SIZE, uaddr < end))
368 367 ;
... ... @@ -532,7 +531,8 @@
532 531 blocksize_bits = vol->sb->s_blocksize_bits;
533 532 u = 0;
534 533 do {
535   - struct page *page = pages[u];
  534 + page = pages[u];
  535 + BUG_ON(!page);
536 536 /*
537 537 * create_empty_buffers() will create uptodate/dirty buffers if
538 538 * the page is uptodate/dirty.
... ... @@ -1291,7 +1291,7 @@
1291 1291 size_t bytes)
1292 1292 {
1293 1293 struct page **last_page = pages + nr_pages;
1294   - char *kaddr;
  1294 + char *addr;
1295 1295 size_t total = 0;
1296 1296 unsigned len;
1297 1297 int left;
1298 1298  
... ... @@ -1300,13 +1300,13 @@
1300 1300 len = PAGE_CACHE_SIZE - ofs;
1301 1301 if (len > bytes)
1302 1302 len = bytes;
1303   - kaddr = kmap_atomic(*pages, KM_USER0);
1304   - left = __copy_from_user_inatomic(kaddr + ofs, buf, len);
1305   - kunmap_atomic(kaddr, KM_USER0);
  1303 + addr = kmap_atomic(*pages, KM_USER0);
  1304 + left = __copy_from_user_inatomic(addr + ofs, buf, len);
  1305 + kunmap_atomic(addr, KM_USER0);
1306 1306 if (unlikely(left)) {
1307 1307 /* Do it the slow way. */
1308   - kaddr = kmap(*pages);
1309   - left = __copy_from_user(kaddr + ofs, buf, len);
  1308 + addr = kmap(*pages);
  1309 + left = __copy_from_user(addr + ofs, buf, len);
1310 1310 kunmap(*pages);
1311 1311 if (unlikely(left))
1312 1312 goto err_out;
1313 1313  
1314 1314  
1315 1315  
1316 1316  
... ... @@ -1408,26 +1408,26 @@
1408 1408 size_t *iov_ofs, size_t bytes)
1409 1409 {
1410 1410 struct page **last_page = pages + nr_pages;
1411   - char *kaddr;
  1411 + char *addr;
1412 1412 size_t copied, len, total = 0;
1413 1413  
1414 1414 do {
1415 1415 len = PAGE_CACHE_SIZE - ofs;
1416 1416 if (len > bytes)
1417 1417 len = bytes;
1418   - kaddr = kmap_atomic(*pages, KM_USER0);
1419   - copied = __ntfs_copy_from_user_iovec_inatomic(kaddr + ofs,
  1418 + addr = kmap_atomic(*pages, KM_USER0);
  1419 + copied = __ntfs_copy_from_user_iovec_inatomic(addr + ofs,
1420 1420 *iov, *iov_ofs, len);
1421   - kunmap_atomic(kaddr, KM_USER0);
  1421 + kunmap_atomic(addr, KM_USER0);
1422 1422 if (unlikely(copied != len)) {
1423 1423 /* Do it the slow way. */
1424   - kaddr = kmap(*pages);
1425   - copied = __ntfs_copy_from_user_iovec_inatomic(kaddr + ofs,
  1424 + addr = kmap(*pages);
  1425 + copied = __ntfs_copy_from_user_iovec_inatomic(addr + ofs,
1426 1426 *iov, *iov_ofs, len);
1427 1427 /*
1428 1428 * Zero the rest of the target like __copy_from_user().
1429 1429 */
1430   - memset(kaddr + ofs + copied, 0, len - copied);
  1430 + memset(addr + ofs + copied, 0, len - copied);
1431 1431 kunmap(*pages);
1432 1432 if (unlikely(copied != len))
1433 1433 goto err_out;
... ... @@ -1735,8 +1735,6 @@
1735 1735 read_unlock_irqrestore(&ni->size_lock, flags);
1736 1736 BUG_ON(initialized_size != i_size);
1737 1737 if (end > initialized_size) {
1738   - unsigned long flags;
1739   -
1740 1738 write_lock_irqsave(&ni->size_lock, flags);
1741 1739 ni->initialized_size = end;
1742 1740 i_size_write(vi, end);
... ... @@ -34,7 +34,6 @@
34 34 #include "dir.h"
35 35 #include "debug.h"
36 36 #include "inode.h"
37   -#include "attrib.h"
38 37 #include "lcnalloc.h"
39 38 #include "malloc.h"
40 39 #include "mft.h"
... ... @@ -2500,8 +2499,6 @@
2500 2499 /* Resize the attribute record to best fit the new attribute size. */
2501 2500 if (new_size < vol->mft_record_size &&
2502 2501 !ntfs_resident_attr_value_resize(m, a, new_size)) {
2503   - unsigned long flags;
2504   -
2505 2502 /* The resize succeeded! */
2506 2503 flush_dcache_mft_record_page(ctx->ntfs_ino);
2507 2504 mark_mft_record_dirty(ctx->ntfs_ino);
1 1 /*
2 2 * logfile.c - NTFS kernel journal handling. Part of the Linux-NTFS project.
3 3 *
4   - * Copyright (c) 2002-2005 Anton Altaparmakov
  4 + * Copyright (c) 2002-2007 Anton Altaparmakov
5 5 *
6 6 * This program/include file is free software; you can redistribute it and/or
7 7 * modify it under the terms of the GNU General Public License as published
8 8  
9 9  
10 10  
11 11  
... ... @@ -724,24 +724,139 @@
724 724 */
725 725 bool ntfs_empty_logfile(struct inode *log_vi)
726 726 {
727   - ntfs_volume *vol = NTFS_SB(log_vi->i_sb);
  727 + VCN vcn, end_vcn;
  728 + ntfs_inode *log_ni = NTFS_I(log_vi);
  729 + ntfs_volume *vol = log_ni->vol;
  730 + struct super_block *sb = vol->sb;
  731 + runlist_element *rl;
  732 + unsigned long flags;
  733 + unsigned block_size, block_size_bits;
  734 + int err;
  735 + bool should_wait = true;
728 736  
729 737 ntfs_debug("Entering.");
730   - if (!NVolLogFileEmpty(vol)) {
731   - int err;
732   -
733   - err = ntfs_attr_set(NTFS_I(log_vi), 0, i_size_read(log_vi),
734   - 0xff);
735   - if (unlikely(err)) {
736   - ntfs_error(vol->sb, "Failed to fill $LogFile with "
737   - "0xff bytes (error code %i).", err);
738   - return false;
  738 + if (NVolLogFileEmpty(vol)) {
  739 + ntfs_debug("Done.");
  740 + return true;
  741 + }
  742 + /*
  743 + * We cannot use ntfs_attr_set() because we may be still in the middle
  744 + * of a mount operation. Thus we do the emptying by hand by first
  745 + * zapping the page cache pages for the $LogFile/$DATA attribute and
  746 + * then emptying each of the buffers in each of the clusters specified
  747 + * by the runlist by hand.
  748 + */
  749 + block_size = sb->s_blocksize;
  750 + block_size_bits = sb->s_blocksize_bits;
  751 + vcn = 0;
  752 + read_lock_irqsave(&log_ni->size_lock, flags);
  753 + end_vcn = (log_ni->initialized_size + vol->cluster_size_mask) >>
  754 + vol->cluster_size_bits;
  755 + read_unlock_irqrestore(&log_ni->size_lock, flags);
  756 + truncate_inode_pages(log_vi->i_mapping, 0);
  757 + down_write(&log_ni->runlist.lock);
  758 + rl = log_ni->runlist.rl;
  759 + if (unlikely(!rl || vcn < rl->vcn || !rl->length)) {
  760 +map_vcn:
  761 + err = ntfs_map_runlist_nolock(log_ni, vcn, NULL);
  762 + if (err) {
  763 + ntfs_error(sb, "Failed to map runlist fragment (error "
  764 + "%d).", -err);
  765 + goto err;
739 766 }
740   - /* Set the flag so we do not have to do it again on remount. */
741   - NVolSetLogFileEmpty(vol);
  767 + rl = log_ni->runlist.rl;
  768 + BUG_ON(!rl || vcn < rl->vcn || !rl->length);
742 769 }
  770 + /* Seek to the runlist element containing @vcn. */
  771 + while (rl->length && vcn >= rl[1].vcn)
  772 + rl++;
  773 + do {
  774 + LCN lcn;
  775 + sector_t block, end_block;
  776 + s64 len;
  777 +
  778 + /*
  779 + * If this run is not mapped map it now and start again as the
  780 + * runlist will have been updated.
  781 + */
  782 + lcn = rl->lcn;
  783 + if (unlikely(lcn == LCN_RL_NOT_MAPPED)) {
  784 + vcn = rl->vcn;
  785 + goto map_vcn;
  786 + }
  787 + /* If this run is not valid abort with an error. */
  788 + if (unlikely(!rl->length || lcn < LCN_HOLE))
  789 + goto rl_err;
  790 + /* Skip holes. */
  791 + if (lcn == LCN_HOLE)
  792 + continue;
  793 + block = lcn << vol->cluster_size_bits >> block_size_bits;
  794 + len = rl->length;
  795 + if (rl[1].vcn > end_vcn)
  796 + len = end_vcn - rl->vcn;
  797 + end_block = (lcn + len) << vol->cluster_size_bits >>
  798 + block_size_bits;
  799 + /* Iterate over the blocks in the run and empty them. */
  800 + do {
  801 + struct buffer_head *bh;
  802 +
  803 + /* Obtain the buffer, possibly not uptodate. */
  804 + bh = sb_getblk(sb, block);
  805 + BUG_ON(!bh);
  806 + /* Setup buffer i/o submission. */
  807 + lock_buffer(bh);
  808 + bh->b_end_io = end_buffer_write_sync;
  809 + get_bh(bh);
  810 + /* Set the entire contents of the buffer to 0xff. */
  811 + memset(bh->b_data, -1, block_size);
  812 + if (!buffer_uptodate(bh))
  813 + set_buffer_uptodate(bh);
  814 + if (buffer_dirty(bh))
  815 + clear_buffer_dirty(bh);
  816 + /*
  817 + * Submit the buffer and wait for i/o to complete but
  818 + * only for the first buffer so we do not miss really
  819 + * serious i/o errors. Once the first buffer has
  820 + * completed ignore errors afterwards as we can assume
  821 + * that if one buffer worked all of them will work.
  822 + */
  823 + submit_bh(WRITE, bh);
  824 + if (should_wait) {
  825 + should_wait = false;
  826 + wait_on_buffer(bh);
  827 + if (unlikely(!buffer_uptodate(bh)))
  828 + goto io_err;
  829 + }
  830 + brelse(bh);
  831 + } while (++block < end_block);
  832 + } while ((++rl)->vcn < end_vcn);
  833 + up_write(&log_ni->runlist.lock);
  834 + /*
  835 + * Zap the pages again just in case any got instantiated whilst we were
  836 + * emptying the blocks by hand. FIXME: We may not have completed
  837 + * writing to all the buffer heads yet so this may happen too early.
  838 + * We really should use a kernel thread to do the emptying
  839 + * asynchronously and then we can also set the volume dirty and output
  840 + * an error message if emptying should fail.
  841 + */
  842 + truncate_inode_pages(log_vi->i_mapping, 0);
  843 + /* Set the flag so we do not have to do it again on remount. */
  844 + NVolSetLogFileEmpty(vol);
743 845 ntfs_debug("Done.");
744 846 return true;
  847 +io_err:
  848 + ntfs_error(sb, "Failed to write buffer. Unmount and run chkdsk.");
  849 + goto dirty_err;
  850 +rl_err:
  851 + ntfs_error(sb, "Runlist is corrupt. Unmount and run chkdsk.");
  852 +dirty_err:
  853 + NVolSetErrors(vol);
  854 + err = -EIO;
  855 +err:
  856 + up_write(&log_ni->runlist.lock);
  857 + ntfs_error(sb, "Failed to fill $LogFile with 0xff bytes (error %d).",
  858 + -err);
  859 + return false;
745 860 }
746 861  
747 862 #endif /* NTFS_RW */
1 1 /**
2 2 * runlist.c - NTFS runlist handling code. Part of the Linux-NTFS project.
3 3 *
4   - * Copyright (c) 2001-2005 Anton Altaparmakov
  4 + * Copyright (c) 2001-2007 Anton Altaparmakov
5 5 * Copyright (c) 2002-2005 Richard Russon
6 6 *
7 7 * This program/include file is free software; you can redistribute it and/or
... ... @@ -1714,7 +1714,7 @@
1714 1714 sizeof(*rl));
1715 1715 /* Adjust the beginning of the tail if necessary. */
1716 1716 if (end > rl->vcn) {
1717   - s64 delta = end - rl->vcn;
  1717 + delta = end - rl->vcn;
1718 1718 rl->vcn = end;
1719 1719 rl->length -= delta;
1720 1720 /* Only adjust the lcn if it is real. */