27 Aug, 2011

1 commit


26 Aug, 2011

1 commit

  • Purely in-memory filesystems do not use the inode hash as the dcache
    tells us if an entry already exists. As a result, they do not call
    unlock_new_inode, and thus directory inodes do not get put into a
    different lockdep class for i_sem.

    We need the different lockdep classes, because the locking order for
    i_mutex is different for directory inodes and regular inodes. Directory
    inodes can do "readdir()", which takes i_mutex *before* possibly taking
    mm->mmap_sem (due to a page fault while copying the directory entry to
    user space).

    In contrast, regular inodes can be mmap'ed, which takes mm->mmap_sem
    before accessing i_mutex.

    The two cases can never happen for the same inode, so no real deadlock
    can occur, but without the different lockdep classes, lockdep cannot
    understand that. As a result, if CONFIG_DEBUG_LOCK_ALLOC is set, this
    can lead to false positives from lockdep like below:

    find/645 is trying to acquire lock:
    (&mm->mmap_sem){++++++}, at: [] might_fault+0x5c/0xac

    but task is already holding lock:
    (&sb->s_type->i_mutex_key#15){+.+.+.}, at: []
    vfs_readdir+0x5b/0xb4

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&sb->s_type->i_mutex_key#15){+.+.+.}:
    [] lock_acquire+0xbf/0x103
    [] __mutex_lock_common+0x4c/0x361
    [] mutex_lock_nested+0x40/0x45
    [] hugetlbfs_file_mmap+0x82/0x110
    [] mmap_region+0x258/0x432
    [] do_mmap_pgoff+0x2ac/0x306
    [] sys_mmap_pgoff+0x118/0x16a
    [] sys_mmap+0x22/0x24
    [] system_call_fastpath+0x16/0x1b

    -> #0 (&mm->mmap_sem){++++++}:
    [] __lock_acquire+0xa1a/0xcf7
    [] lock_acquire+0xbf/0x103
    [] might_fault+0x89/0xac
    [] filldir+0x6f/0xc7
    [] dcache_readdir+0x67/0x205
    [] vfs_readdir+0x7b/0xb4
    [] sys_getdents+0x7e/0xd1
    [] system_call_fastpath+0x16/0x1b

    This patch moves the directory vs file lockdep annotation into a helper
    function that can be called by in-memory filesystems and has hugetlbfs
    call it.

    Signed-off-by: Josh Boyer
    Acked-by: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Josh Boyer
     

25 Aug, 2011

1 commit


24 Aug, 2011

2 commits


23 Aug, 2011

1 commit

  • The code really requires the current source directory to be in the
    header search path. We already do this if building with an object
    tree separate from the source, but it needs to be added manually
    if building inside the source. The cflags addition for it accidentally
    got removed when collapsing the xfs directory structure.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

21 Aug, 2011

2 commits

  • This fixes a regression introduced by commit cdcb725c05fe ("Btrfs: check
    if there is enough space for balancing smarter"). We can't do 64-bit
    divides on 32-bit architectures.

    In cases where we need to divide/multiply by 2 we should just left/right
    shift respectively, and in cases where theres N number of devices use
    do_div. Also make the counters u64 to match up with rw_devices.
    Thanks,

    Signed-off-by: Josef Bacik
    Acked-and-tested-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Josef Bacik
     
  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: flush any pending end_io requests before DIO reads w/dioread_nolock
    ext4: fix nomblk_io_submit option so it correctly converts uninit blocks
    ext4: Resolve the hang of direct i/o read in handling EXT4_IO_END_UNWRITTEN.
    ext4: call ext4_ioend_wait and ext4_flush_completed_IO in ext4_evict_inode
    ext4: Fix ext4_should_writeback_data() for no-journal mode

    Linus Torvalds
     

20 Aug, 2011

1 commit

  • There is a race between ext4 buffer write and direct_IO read with
    dioread_nolock mount option enabled. The problem is that we clear
    PageWriteback flag during end_io time but will do
    uninitialized-to-initialized extent conversion later with dioread_nolock.
    If an O_direct read request comes in during this period, ext4 will return
    zero instead of the recently written data.

    This patch checks whether there are any pending uninitialized-to-initialized
    extent conversion requests before doing O_direct read to close the race.
    Note that this is just a bandaid fix. The fundamental issue is that we
    clear PageWriteback flag before we really complete an IO, which is
    problem-prone. To fix the fundamental issue, we may need to implement an
    extent tree cache that we can use to look up pending to-be-converted extents.

    Signed-off-by: Jiaying Zhang
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Jiaying Zhang
     

19 Aug, 2011

6 commits


18 Aug, 2011

5 commits


17 Aug, 2011

13 commits

  • FAT16 support maximum 4GB vol/file size with 64KB cluster size.

    Win NT/XP/7 increased the maximum cluster size to 64KB, and file/vol
    size increased 4GB also. Although increasing, the file size of linux
    FAT is still limited at 2GB.

    I found that it is limited by sb->maxbytes(0x7fffffff) when partition
    is formatted by FAT16. sb->s_maxbytes in fill_super should be set to
    0xffffffff like fat32.

    Signed-off-by: Namjae Jeon
    Signed-off-by: OGAWA Hirofumi

    Namjae Jeon
     
  • The fat_msg function already formats the given message and appends
    a newline to it - we don't need to do this in the passed message
    string as well, or will end up with a blank line printed in the
    kernel log ring buffer.

    Also change the loglevel from error to warning.

    Signed-off-by: Mihai Moldovan
    Signed-off-by: OGAWA Hirofumi

    Mihai Moldovan
     
  • This fixes a compile warning (unititialized variable) in
    the fat filesystem code.

    Signed-off-by: Jonas Aberg
    Signed-off-by: Lee Jones
    Signed-off-by: Linus Walleij
    Signed-off-by: OGAWA Hirofumi

    Jonas Aberg
     
  • We need to truncate page cache pages for the clone ioctl target range or
    else we'll confuse ourselves to no end. If the old data was cached, we
    used to still see it (until remount). If the page was partially updated
    we used to get a mix of old and new data.

    Signed-off-by: Sage Weil
    Signed-off-by: Chris Mason

    Sage Weil
     
  • sync_pending is uninitialized before it be used, fix it.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • Btrfs subtracted the size of the allocated space twice when it allocated
    the space from the bitmap in the cluster, it broke the free space information
    and led to oops finally.

    And this patch also fixes the bug that ctl->free_space was subtracted
    without lock.

    Reported-by: Liu Bo
    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • We don't use the defrag struct on this path.

    Signed-off-by: Dan Carpenter
    Signed-off-by: Chris Mason

    Dan Carpenter
     
  • We've stopped using highmem for extent buffers.

    Signed-off-by: Li Zefan
    Signed-off-by: Chris Mason

    Li Zefan
     
  • The filesystem turns readonly instead of returning the error to the
    caller when detected error in btrfs_drop_snapshot().
    and, because the caller doesn't check the error, the function type is
    changed to 'void'.

    Signed-off-by: Tsutomu Itoh
    Signed-off-by: Chris Mason

    Tsutomu Itoh
     
  • When checking if there is enough space for balancing a block group,
    since we do not take raid types into consideration, we do not account
    corrent amounts of space that we needed. This makes us do some extra
    work before we get ENOSPC.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    liubo
     
  • When balancing, we'll first try to shrink devices for some space,
    but if it is working on a full multi-disk partition with raid protection,
    we may encounter a bug, that is, while shrinking, total_bytes may be less
    than bytes_used, and btrfs may allocate a dev extent that accesses out of
    device's bounds.

    Then we will not be able to write or read the data which stores at the end
    of the device, and get the followings:

    device fsid 0939f071-7ea3-46c8-95df-f176d773bfb6 devid 1 transid 10 /dev/sdb5
    Btrfs detected SSD devices, enabling SSD mode
    btrfs: relocating block group 476315648 flags 9
    btrfs: found 4 extents
    attempt to access beyond end of device
    sdb5: rw=145, want=546176, limit=546147
    attempt to access beyond end of device
    sdb5: rw=145, want=546304, limit=546147
    attempt to access beyond end of device
    sdb5: rw=145, want=546432, limit=546147
    attempt to access beyond end of device
    sdb5: rw=145, want=546560, limit=546147
    attempt to access beyond end of device

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    liubo
     
  • When btrfs recovers from a crash, it may hit the oops below:

    ------------[ cut here ]------------
    kernel BUG at fs/btrfs/inode.c:4580!
    [...]
    RIP: 0010:[] [] btrfs_add_link+0x161/0x1c0 [btrfs]
    [...]
    Call Trace:
    [] ? btrfs_inode_ref_index+0x31/0x80 [btrfs]
    [] add_inode_ref+0x319/0x3f0 [btrfs]
    [] replay_one_buffer+0x2c7/0x390 [btrfs]
    [] walk_down_log_tree+0x32a/0x480 [btrfs]
    [] walk_log_tree+0xf5/0x240 [btrfs]
    [] btrfs_recover_log_trees+0x250/0x350 [btrfs]
    [] ? btrfs_recover_log_trees+0x350/0x350 [btrfs]
    [] open_ctree+0x1442/0x17d0 [btrfs]
    [...]

    This comes from that while replaying an inode ref item, we forget to
    check those old conflicting DIR_ITEM and DIR_INDEX items in fs/file tree,
    then we will come to conflict corners which lead to BUG_ON().

    Signed-off-by: Liu Bo
    Tested-by: Andy Lutomirski
    Signed-off-by: Chris Mason

    liubo
     
  • We have a problem where if a user specifies discard but doesn't actually support
    it we will return EOPNOTSUPP from btrfs_discard_extent. This is a problem
    because this gets called (in a fashion) from the tree log recovery code, which
    has a nice little BUG_ON(ret) after it, which causes us to fail the tree log
    replay. So instead detect wether our devices support discard when we're adding
    them and then don't issue discards if we know that the device doesn't support
    it. And just for good measure set ret = 0 in btrfs_issue_discard just in case
    we still get EOPNOTSUPP so we don't screw anybody up like this again. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

16 Aug, 2011

1 commit

  • Running the cthon tests on a recent kernel caused this message to pop
    occasionally:

    CIFS VFS: did not end path lookup where expected namelen is 0

    Some added debugging showed that namelen and dfsplen were both 0 when
    this occurred. That means that the read_seqretry returned true.

    Assuming that the comment inside the if statement is true, this should
    be harmless and just means that we raced with a rename. If that is the
    case, then there's no need for alarm and we can demote this to cFYI.

    While we're at it, print the dfsplen too so that we can see what
    happened here if the message pops during debugging.

    Cc: stable@kernel.org
    Cc: Al Viro
    Signed-off-by: Jeff Layton
    Signed-off-by: Steve French

    Jeff Layton
     

15 Aug, 2011

2 commits


14 Aug, 2011

3 commits

  • Bug discovered by Jan Kara:

    Finally, commit 1449032be17abb69116dbc393f67ceb8bd034f92 returned back
    the old IO submission code but apparently it forgot to return the old
    handling of uninitialized buffers so we unconditionnaly call
    block_write_full_page() without specifying end_io function. So AFAICS
    we never convert unwritten extents to written in some cases. For
    example when I mount the fs as: mount -t ext4 -o
    nomblk_io_submit,dioread_nolock /dev/ubdb /mnt and do
    int fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0600);
    char buf[1024];
    memset(buf, 'a', sizeof(buf));
    fallocate(fd, 0, 0, 16384);
    write(fd, buf, sizeof(buf));

    I get a file full of zeros (after remounting the filesystem so that
    pagecache is dropped) instead of seeing the first KB contain 'a's.

    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Theodore Ts'o
     
  • EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten
    should be done simultaneously since ext4_end_io_nolock always clear
    the flag and decrease the counter in the same time.

    We don't increase i_aiodio_unwritten when setting
    EXT4_IO_END_UNWRITTEN so it will go nagative and causes some process
    to wait forever.

    Part of the patch came from Eric in his e-mail, but it doesn't fix the
    problem met by Michael actually.

    http://marc.info/?l=linux-ext4&m=131316851417460&w=2

    Reported-and-Tested-by: Michael Tokarev
    Signed-off-by: Eric Sandeen
    Signed-off-by: Tao Ma
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Tao Ma
     
  • Flush inode's i_completed_io_list before calling ext4_io_wait to
    prevent the following deadlock scenario: A page fault happens while
    some process is writing inode A. During page fault,
    shrink_icache_memory is called that in turn evicts another inode
    B. Inode B has some pending io_end work so it calls ext4_ioend_wait()
    that waits for inode B's i_ioend_count to become zero. However, inode
    B's ioend work was queued behind some of inode A's ioend work on the
    same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten
    thread on that cpu is processing inode A's ioend work, it tries to
    grab inode A's i_mutex lock. Since the i_mutex lock of inode A is
    still hold before the page fault happened, we enter a deadlock.

    Also moves ext4_flush_completed_IO and ext4_ioend_wait from
    ext4_destroy_inode() to ext4_evict_inode(). During inode deleteion,
    ext4_evict_inode() is called before ext4_destroy_inode() and in
    ext4_evict_inode(), we may call ext4_truncate() without holding
    i_mutex lock. As a result, there is a race between flush_completed_IO
    that is called from ext4_ext_truncate() and ext4_end_io_work, which
    may cause corruption on an io_end structure. This change moves
    ext4_flush_completed_IO and ext4_ioend_wait from ext4_destroy_inode()
    to ext4_evict_inode() to resolve the race between ext4_truncate() and
    ext4_end_io_work during inode deletion.

    Signed-off-by: Jiaying Zhang
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Jiaying Zhang
     

13 Aug, 2011

1 commit

  • ext4_should_writeback_data() had an incorrect sequence of
    tests to determine if it should return 0 or 1: in
    particular, even in no-journal mode, 0 was being returned
    for a non-regular-file inode.

    This meant that, in non-journal mode, we would use
    ext4_journalled_aops for directories, symlinks, and other
    non-regular files. However, calling journalled aop
    callbacks when there is no valid handle, can cause problems.

    This would cause a kernel crash with Jan Kara's commit
    2d859db3e4 ("ext4: fix data corruption in inodes with
    journalled data"), because we now dereference 'handle' in
    ext4_journalled_write_end().

    I also added BUG_ONs to check for a valid handle in the
    obviously journal-only aops callbacks.

    I tested this running xfstests with a scratch device in
    these modes:

    - no-journal
    - data=ordered
    - data=writeback
    - data=journal

    All work fine; the data=journal run has many failures and a
    crash in xfstests 074, but this is no different from a
    vanilla kernel.

    Signed-off-by: Curt Wohlgemuth
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Curt Wohlgemuth