13 Sep, 2011

3 commits

  • * 'for-linus' of git://github.com/chrismason/linux:
    Btrfs: add dummy extent if dst offset excceeds file end in
    Btrfs: calc file extent num_bytes correctly in file clone
    btrfs: xattr: fix attribute removal
    Btrfs: fix wrong nbytes information of the inode
    Btrfs: fix the file extent gap when doing direct IO
    Btrfs: fix unclosed transaction handle in btrfs_cont_expand
    Btrfs: fix misuse of trans block rsv
    Btrfs: reset to appropriate block rsv after orphan operations
    Btrfs: skip locking if searching the commit root in csum lookup
    btrfs: fix warning in iput for bad-inode
    Btrfs: fix an oops when deleting snapshots

    Linus Torvalds
     
  • kmemleak is reporting that 32 bytes are being leaked by FUSE:

    unreferenced object 0xe373b270 (size 32):
    comm "fusermount", pid 1207, jiffies 4294707026 (age 2675.187s)
    hex dump (first 32 bytes):
    01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kmemleak_alloc+0x27/0x50
    [] kmem_cache_alloc+0xc5/0x180
    [] fuse_alloc_forget+0x1e/0x20
    [] fuse_alloc_inode+0xb0/0xd0
    [] alloc_inode+0x1c/0x80
    [] iget5_locked+0x8f/0x1a0
    [] fuse_iget+0x72/0x1a0
    [] fuse_get_root_inode+0x8a/0x90
    [] fuse_fill_super+0x3ef/0x590
    [] mount_nodev+0x3f/0x90
    [] fuse_mount+0x15/0x20
    [] mount_fs+0x1c/0xc0
    [] vfs_kern_mount+0x41/0x90
    [] do_kern_mount+0x39/0xd0
    [] do_mount+0x2e5/0x660
    [] sys_mount+0x66/0xa0

    This leak report is consistent and happens once per boot on
    3.1.0-rc5-dirty.

    This happens if a FORGET request is queued after the fuse device was
    released.

    Reported-by: Sitsofe Wheeler
    Signed-off-by: Miklos Szeredi
    Tested-by: Sitsofe Wheeler
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Commit 37fb3a30b4 ("fuse: fix flock") added in 3.1-rc4 caused flock() to
    fail with ENOSYS with the kernel ABI version 7.16 or earlier.

    Fix by falling back to testing FUSE_POSIX_LOCKS for ABI versions 7.16
    and earlier.

    Reported-by: Martin Ziegler
    Signed-off-by: Miklos Szeredi
    Tested-by: Martin Ziegler
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

11 Sep, 2011

12 commits

  • You can see there's no file extent with range [0, 4096]. Check this by
    btrfsck:

    # btrfsck /dev/sda7
    root 5 inode 258 errors 100
    ...

    Signed-off-by: Li Zefan
    Signed-off-by: Chris Mason

    Li Zefan
     
  • num_bytes should be 4096 not 12288.

    Signed-off-by: Li Zefan
    Signed-off-by: Chris Mason

    Li Zefan
     
  • An attribute is not removed by 'setfattr -x attr file' and remains
    visible in attr list. This makes xfstests/062 pass again.

    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    David Sterba
     
  • If we write some data into the data hole of the file(no preallocation for this
    hole), Btrfs will allocate some disk space, and update nbytes of the inode, but
    the other element--disk_i_size needn't be updated. At this condition, we must
    update inode metadata though disk_i_size is not changed(btrfs_ordered_update_i_size()
    return 1).

    # mkfs.btrfs /dev/sdb1
    # mount /dev/sdb1 /mnt
    # touch /mnt/a
    # truncate -s 856002 /mnt/a
    # dd if=/dev/zero of=/mnt/a bs=4K count=1 conv=nocreat,notrunc
    # umount /mnt
    # btrfsck /dev/sdb1
    root 5 inode 257 errors 400
    found 32768 bytes used err is 1

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • When we write some data to the place that is beyond the end of the file
    in direct I/O mode, a data hole will be created. And Btrfs should insert
    a file extent item that point to this hole into the fs tree. But unfortunately
    Btrfs forgets doing it.

    The following is a simple way to reproduce it:
    # mkfs.btrfs /dev/sdc2
    # mount /dev/sdc2 /test4
    # touch /test4/a
    # dd if=/dev/zero of=/test4/a seek=8 count=1 bs=4K oflag=direct conv=nocreat,notrunc
    # umount /test4
    # btrfsck /dev/sdc2
    root 5 inode 257 errors 100

    Reported-by: Tsutomu Itoh
    Signed-off-by: Miao Xie
    Tested-by: Tsutomu Itoh
    Signed-off-by: Chris Mason

    Miao Xie
     
  • The function - btrfs_cont_expand() forgot to close the transaction handle before
    it jump out the while loop. Fix it.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • At the beginning of create_pending_snapshot, trans->block_rsv is set
    to pending->block_rsv and is used for snapshot things, however, when
    it is done, we do not recover it as will.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Liu Bo
     
  • While truncating free space cache, we forget to change trans->block_rsv
    back to the original one, but leave it with the orphan_block_rsv, and
    then with option inode_cache enable, it leads to countless warnings of
    btrfs_alloc_free_block and btrfs_orphan_commit_root:

    WARNING: at fs/btrfs/extent-tree.c:5711 btrfs_alloc_free_block+0x180/0x350 [btrfs]()
    ...
    WARNING: at fs/btrfs/inode.c:2193 btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Liu Bo
     
  • It's not enough to just search the commit root, since we could be cow'ing the
    very block we need to search through, which would mean that its locked and we'll
    still deadlock. So use path->skip_locking as well. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • iput() shouldn't be called for inodes in I_NEW state.
    We need to mark inode as constructed first.

    WARNING: at fs/inode.c:1309 iput+0x20b/0x210()
    Call Trace:
    [] warn_slowpath_common+0x7a/0xb0
    [] warn_slowpath_null+0x15/0x20
    [] iput+0x20b/0x210
    [] btrfs_iget+0x1eb/0x4a0
    [] btrfs_run_defrag_inodes+0x136/0x210
    [] cleaner_kthread+0x17f/0x1a0
    [] ? sub_preempt_count+0x9d/0xd0
    [] ? transaction_kthread+0x280/0x280
    [] kthread+0x96/0xa0
    [] kernel_thread_helper+0x4/0x10
    [] ? kthread_worker_fn+0x190/0x190
    [] ? gs_change+0xb/0xb

    Signed-off-by: Sergei Trofimovich
    CC: Konstantin Khlebnikov
    Tested-by: David Sterba
    CC: Josef Bacik
    CC: Chris Mason
    Signed-off-by: Chris Mason

    Sergei Trofimovich
     
  • We can reproduce this oops via the following steps:

    $ mkfs.btrfs /dev/sdb7
    $ mount /dev/sdb7 /mnt/btrfs
    $ for ((i=0; ii_ino
    to BTRFS_EMPTY_SUBVOL_DIR_OBJECTID instead of BTRFS_FIRST_FREE_OBJECTID,
    while the snapshot's location.objectid remains unchanged.

    However, btrfs_ino() does not take this into account, and returns a wrong ino,
    and causes the oops.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Liu Bo
     
  • * 'for-linus' of git://neil.brown.name/md:
    md: Fix handling for devices from 2TB to 4TB in 0.90 metadata.
    md/raid1,10: Remove use-after-free bug in make_request.
    md/raid10: unify handling of write completion.
    Avoid dereferencing a 'request_queue' after last close.

    Linus Torvalds
     

10 Sep, 2011

3 commits

  • On the last close of an 'md' device which as been stopped, the device
    is destroyed and in particular the request_queue is freed. The free
    is done in a separate thread so it might happen a short time later.

    __blkdev_put calls bdev_inode_switch_bdi *after* ->release has been
    called.

    Since commit f758eeabeb96f878c860e8f110f94ec8820822a9
    bdev_inode_switch_bdi will dereference the 'old' bdi, which lives
    inside a request_queue, to get a spin lock. This causes the last
    close on an md device to sometime take a spin_lock which lives in
    freed memory - which results in an oops.

    So move the called to bdev_inode_switch_bdi before the call to
    ->release.

    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Cc: Andrew Morton
    Cc: Wu Fengguang
    Acked-by: Wu Fengguang
    Cc: stable@kernel.org
    Signed-off-by: NeilBrown

    NeilBrown
     
  • * 'for-linus' of git://ceph.newdream.net/git/ceph-client:
    libceph: fix leak of osd structs during shutdown
    ceph: fix memory leak
    ceph: fix encoding of ino only (not relative) paths
    libceph: fix msgpool

    Linus Torvalds
     
  • Prior to 2.6.38 automount would not trigger on either stat(2) or
    lstat(2) on the automount point.

    After 2.6.38, with the introduction of the ->d_automount()
    infrastructure, stat(2) and others would start triggering automount
    while lstat(2), etc. still would not. This is a regression and a
    userspace ABI change.

    Problem originally reported here:

    http://thread.gmane.org/gmane.linux.kernel.autofs/6098

    It appears that there was an attempt at fixing various userspace tools
    to not trigger the automount. But since the stat system call is
    rather common it is impossible to "fix" all userspace.

    This patch reverts the original behavior, which is to not trigger on
    stat(2) and other symlink following syscalls.

    [ It's not really clear what the right behavior is. Apparently Solaris
    does the "automount on stat, leave alone on lstat". And some programs
    can get unhappy when "stat+open+fstat" ends up giving a different
    result from the fstat than from the initial stat.

    But the change in 2.6.38 resulted in problems for some people, so
    we're going back to old behavior. Maybe we can re-visit this
    discussion at some future date - Linus ]

    Reported-by: Leonardo Chiquitto
    Signed-off-by: Miklos Szeredi
    Acked-by: Ian Kent
    Cc: David Howells
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

08 Sep, 2011

1 commit


06 Sep, 2011

5 commits


02 Sep, 2011

1 commit


01 Sep, 2011

3 commits

  • Currently we always redirty an inode that was attempted to be written out
    synchronously but has been cleaned by an AIL pushed internall, which is
    rather bogus. Fix that by doing the i_update_core check early on and
    return 0 for it. Also include async calls for it, as doing any work for
    those is just as pointless. While we're at it also fix the sign for the
    EIO return in case of a filesystem shutdown, and fix the completely
    non-sensical locking around xfs_log_inode.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder
    (cherry picked from commit 297db93bb74cf687510313eb235a7aec14d67e97)

    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • During umount we do not add a dirty inode to the lru and wait for it to
    become clean first, but force writeback of data and metadata with
    I_WILL_FREE set. Currently there is no way for XFS to detect that the
    inode has been redirtied for metadata operations, as we skip the
    mark_inode_dirty call during teardown. Fix this by setting i_update_core
    nanually in that case, so that the inode gets flushed during inode reclaim.

    Alternatively we could enable calling mark_inode_dirty for inodes in
    I_WILL_FREE state, and let the VFS dirty tracking handle this. I decided
    against this as we will get better I/O patterns from reclaim compared to
    the synchronous writeout in write_inode_now, and always marking the inode
    dirty in some way from xfs_mark_inode_dirty is a better safetly net in
    either case.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder
    (cherry picked from commit da6742a5a4cc844a9982fdd936ddb537c0747856)

    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • * tag 'for_linus-20110831' of git://github.com/tytso/ext4:
    ext4: remove i_mutex lock in ext4_evict_inode to fix lockdep complaining

    Linus Torvalds
     

31 Aug, 2011

1 commit

  • The i_mutex lock and flush_completed_IO() added by commit 2581fdc810
    in ext4_evict_inode() causes lockdep complaining about potential
    deadlock in several places. In most/all of these LOCKDEP complaints
    it looks like it's a false positive, since many of the potential
    circular locking cases can't take place by the time the
    ext4_evict_inode() is called; but since at the very least it may mask
    real problems, we need to address this.

    This change removes the flush_completed_IO() and i_mutex lock in
    ext4_evict_inode(). Instead, we take a different approach to resolve
    the software lockup that commit 2581fdc810 intends to fix. Rather
    than having ext4-dio-unwritten thread wait for grabing the i_mutex
    lock of an inode, we use mutex_trylock() instead, and simply requeue
    the work item if we fail to grab the inode's i_mutex lock.

    This should speed up work queue processing in general and also
    prevents the following deadlock scenario: During page fault,
    shrink_icache_memory is called that in turn evicts another inode B.
    Inode B has some pending io_end work so it calls ext4_ioend_wait()
    that waits for inode B's i_ioend_count to become zero. However, inode
    B's ioend work was queued behind some of inode A's ioend work on the
    same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten
    thread on that cpu is processing inode A's ioend work, it tries to
    grab inode A's i_mutex lock. Since the i_mutex lock of inode A is
    still hold before the page fault happened, we enter a deadlock.

    Signed-off-by: Jiaying Zhang
    Signed-off-by: "Theodore Ts'o"

    Jiaying Zhang
     

27 Aug, 2011

1 commit


26 Aug, 2011

1 commit

  • Purely in-memory filesystems do not use the inode hash as the dcache
    tells us if an entry already exists. As a result, they do not call
    unlock_new_inode, and thus directory inodes do not get put into a
    different lockdep class for i_sem.

    We need the different lockdep classes, because the locking order for
    i_mutex is different for directory inodes and regular inodes. Directory
    inodes can do "readdir()", which takes i_mutex *before* possibly taking
    mm->mmap_sem (due to a page fault while copying the directory entry to
    user space).

    In contrast, regular inodes can be mmap'ed, which takes mm->mmap_sem
    before accessing i_mutex.

    The two cases can never happen for the same inode, so no real deadlock
    can occur, but without the different lockdep classes, lockdep cannot
    understand that. As a result, if CONFIG_DEBUG_LOCK_ALLOC is set, this
    can lead to false positives from lockdep like below:

    find/645 is trying to acquire lock:
    (&mm->mmap_sem){++++++}, at: [] might_fault+0x5c/0xac

    but task is already holding lock:
    (&sb->s_type->i_mutex_key#15){+.+.+.}, at: []
    vfs_readdir+0x5b/0xb4

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&sb->s_type->i_mutex_key#15){+.+.+.}:
    [] lock_acquire+0xbf/0x103
    [] __mutex_lock_common+0x4c/0x361
    [] mutex_lock_nested+0x40/0x45
    [] hugetlbfs_file_mmap+0x82/0x110
    [] mmap_region+0x258/0x432
    [] do_mmap_pgoff+0x2ac/0x306
    [] sys_mmap_pgoff+0x118/0x16a
    [] sys_mmap+0x22/0x24
    [] system_call_fastpath+0x16/0x1b

    -> #0 (&mm->mmap_sem){++++++}:
    [] __lock_acquire+0xa1a/0xcf7
    [] lock_acquire+0xbf/0x103
    [] might_fault+0x89/0xac
    [] filldir+0x6f/0xc7
    [] dcache_readdir+0x67/0x205
    [] vfs_readdir+0x7b/0xb4
    [] sys_getdents+0x7e/0xd1
    [] system_call_fastpath+0x16/0x1b

    This patch moves the directory vs file lockdep annotation into a helper
    function that can be called by in-memory filesystems and has hugetlbfs
    call it.

    Signed-off-by: Josh Boyer
    Acked-by: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Josh Boyer
     

25 Aug, 2011

2 commits


24 Aug, 2011

2 commits


23 Aug, 2011

2 commits

  • The code really requires the current source directory to be in the
    header search path. We already do this if building with an object
    tree separate from the source, but it needs to be added manually
    if building inside the source. The cflags addition for it accidentally
    got removed when collapsing the xfs directory structure.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • kfree does not clean up indirect allocations in
    ceph_fs_client and ceph_options (e.g. snapdir_name).

    Signed-off-by: Noah Watkins
    Signed-off-by: Sage Weil

    Noah Watkins
     

21 Aug, 2011

2 commits

  • This fixes a regression introduced by commit cdcb725c05fe ("Btrfs: check
    if there is enough space for balancing smarter"). We can't do 64-bit
    divides on 32-bit architectures.

    In cases where we need to divide/multiply by 2 we should just left/right
    shift respectively, and in cases where theres N number of devices use
    do_div. Also make the counters u64 to match up with rw_devices.
    Thanks,

    Signed-off-by: Josef Bacik
    Acked-and-tested-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Josef Bacik
     
  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: flush any pending end_io requests before DIO reads w/dioread_nolock
    ext4: fix nomblk_io_submit option so it correctly converts uninit blocks
    ext4: Resolve the hang of direct i/o read in handling EXT4_IO_END_UNWRITTEN.
    ext4: call ext4_ioend_wait and ext4_flush_completed_IO in ext4_evict_inode
    ext4: Fix ext4_should_writeback_data() for no-journal mode

    Linus Torvalds
     

20 Aug, 2011

1 commit

  • There is a race between ext4 buffer write and direct_IO read with
    dioread_nolock mount option enabled. The problem is that we clear
    PageWriteback flag during end_io time but will do
    uninitialized-to-initialized extent conversion later with dioread_nolock.
    If an O_direct read request comes in during this period, ext4 will return
    zero instead of the recently written data.

    This patch checks whether there are any pending uninitialized-to-initialized
    extent conversion requests before doing O_direct read to close the race.
    Note that this is just a bandaid fix. The fundamental issue is that we
    clear PageWriteback flag before we really complete an IO, which is
    problem-prone. To fix the fundamental issue, we may need to implement an
    extent tree cache that we can use to look up pending to-be-converted extents.

    Signed-off-by: Jiaying Zhang
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Jiaying Zhang