12 Nov, 2009

18 commits

  • Because of an integer overflow on start_blk, various kind of wrong results
    would be returned by the generic_block_fiemap() handler, such as no
    extents when there is a 4GB+ hole at the beginning of the file, or wrong
    fe_logical when an extent starts after the first 4GB.

    Signed-off-by: Mike Hommey
    Cc: Alexander Viro
    Cc: Steven Whitehouse
    Cc: Theodore Ts'o
    Cc: Eric Sandeen
    Cc: Josef Bacik
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Hommey
     
  • In setup_arg_pages we work hard to assign a value to ret, but on exit we
    always return 0.

    Also remove a now duplicated exit path and branch to out_unlock instead.

    Signed-off-by: Anton Blanchard
    Acked-by: Serge Hallyn
    Reviewed-by: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Blanchard
     
  • For FS_IOC_RESVSP and FS_IOC_RESVSP64 compat_sys_ioctl() uses its
    arg argument as a pointer to userspace. However it is missing a
    a call to compat_ptr() which will do a proper pointer conversion.

    This was introduced with 3e63cbb1 "fs: Add new pre-allocation ioctls
    to vfs for compatibility with legacy xfs ioctls".

    Signed-off-by: Heiko Carstens
    Cc: Ankit Jain
    Acked-by: Christoph Hellwig
    Cc: Al Viro
    Acked-by: Arnd Bergmann
    Acked-by: David S. Miller
    Cc: [2.6.31.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • Daniel Lezcano reported a leak in 'struct pid' and 'struct pid_namespace'
    that is discussed in:

    http://lkml.org/lkml/2009/10/2/159.

    To summarize the thread, when container-init is terminated, it sets the
    PF_EXITING flag, zaps other processes in the container and waits to reap
    them. As a part of reaping, the container-init should flush any /proc
    dentries associated with the processes. But because the container-init is
    itself exiting and the following PF_EXITING check, the dentries are not
    flushed, resulting in leak in /proc inodes and dentries.

    This fix reverts the commit 7766755a2f249e7e0 ("Fix /proc dcache deadlock
    in do_exit") which introduced the check for PF_EXITING. At the time of
    the commit, shrink_dcache_parent() flushed dentries from other filesystems
    also and could have caused a deadlock which the commit fixed. But as
    pointed out by Eric Biederman, after commit 0feae5c47aabdde59,
    shrink_dcache_parent() no longer affects other filesystems. So reverting
    the commit is now safe.

    As pointed out by Jan Kara, the leak is not as critical since the
    unclaimed space will be reclaimed under memory pressure or by:

    echo 3 > /proc/sys/vm/drop_caches

    But since this check is no longer required, its best to remove it.

    Signed-off-by: Sukadev Bhattiprolu
    Reported-by: Daniel Lezcano
    Acked-by: Eric W. Biederman
    Acked-by: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Serge Hallyn
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • This fixes:
    ERROR: "log_start_commit" [fs/ext3/ext3.ko] undefined!

    Signed-off-by: Stefan Schmidt

    Stefan Schmidt
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
    Btrfs: fix panic when trying to destroy a newly allocated
    Btrfs: allow more metadata chunk preallocation
    Btrfs: fallback on uncompressed io if compressed io fails
    Btrfs: find ideal block group for caching
    Btrfs: avoid null deref in unpin_extent_cache()
    Btrfs: skip btrfs_release_path in btrfs_update_root and btrfs_del_root
    Btrfs: fix some metadata enospc issues
    Btrfs: fix how we set max_size for free space clusters
    Btrfs: cleanup transaction starting and fix journal_info usage
    Btrfs: fix data allocation hint start

    Linus Torvalds
     
  • There is a problem where iget5_locked will look for an inode, not find it, and
    then subsequently try to allocate it. Another CPU will have raced in and
    allocated the inode instead, so when iget5_locked gets the inode spin lock again
    and does a search, it finds the new inode. So it goes ahead and calls
    destroy_inode on the inode it just allocated. The problem is we don't set
    BTRFS_I(inode)->root until the new inode is completely initialized. This patch
    makes us set root to NULL when alloc'ing a new inode, so when we get to
    btrfs_destroy_inode and we see that root is NULL we can just free up the memory
    and continue on. This fixes the panic

    http://www.kerneloops.org/submitresult.php?number=812690

    Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
    JBD/JBD2: free j_wbuf if journal init fails.
    ext3: Wait for proper transaction commit on fsync
    ext3: retry failed direct IO allocations

    Linus Torvalds
     
  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: partial revert to fix double brelse WARNING()
    ext4: Fix return value of ext4_split_unwritten_extents() to fix direct I/O
    ext4: code clean up for dio fallocate handling
    ext4: skip conversion of uninit extents after direct IO if there isn't any
    ext4: fix ext4_ext_direct_IO()'s return value after converting uninit extents
    ext4: discard preallocation when restarting a transaction during truncate

    Linus Torvalds
     
  • On an FS where all of the space has not been allocated into chunks yet,
    the enospc can return enospc just because the existing metadata chunks
    are full.

    We get around this by allowing more metadata chunks to be allocated up
    to a certain limit, and finding the right limit is a little fuzzy. The
    problem is the reservations for delalloc would preallocate way too much
    of the FS as metadata. We need to start saying no and just force some
    IO to happen.

    But we also need to let a reasonable amount of the FS become metadata.
    This bumps the hard limit up, later releases will have a better system.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Currently compressed IO does not deal with not having its entire extent able to
    be allocated. So if we have enough free space to allocate for the extent, but
    its not contiguous, it will fail spectacularly. This patch fixes this by
    falling back on uncompressed IO which lets us spread the delalloc extent across
    multiple extents. I tested this by making us randomly think the reservation had
    failed to make it fallback on the uncompressed io way and it seemed to work
    fine. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • This patch changes a few things. Hopefully the comments are helpfull, but
    I'll try and be as verbose here.

    Problem:

    My fedora box was taking 1 minute and 21 seconds to boot with btrfs as root.
    Part of this problem was we pick the first block group we can find and start
    caching it, even if it may not have enough free space. The other problem is
    we only search for cached block groups the first time around, which we won't
    find any cached block groups because this is a newly mounted fs, so we end up
    caching several block groups during bootup, which with alot of fragmentation
    takes around 30-45 seconds to complete, which bogs down the system. So

    Solution:

    1) Don't cache block groups willy-nilly at first. Instead try and figure out
    which block group has the most free, and therefore will take the least amount
    of time to cache.

    2) Don't be so picky about cached block groups. The other problem is once
    we've filled up a cluster, if the block group isn't finished caching the next
    time we try and do the allocation we'll completely ignore the cluster and
    start searching from the beginning of the space, which makes us cache more
    block groups, which slows us down even more. So instead of skipping block
    groups that are not finished caching when we have a hint, only skip the block
    group if it hasn't started caching yet.

    There is one other tweak in here. Before if we allocated a chunk and still
    couldn't find new space, we'd end up switching the space info to force another
    chunk allocation. This could make us end up with way too many chunks, so keep
    track of this particular case.

    With this patch and my previous cluster fixes my fedora box now boots in 43
    seconds, and according to the bootchart is not held up by our block group
    caching at all.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • I re-orderred the checks to avoid dereferencing "em" if it was null.

    Found by smatch static checker.

    Signed-off-by: Dan Carpenter
    Signed-off-by: Chris Mason

    Dan Carpenter
     
  • We don't need to call btrfs_release_path because btrfs_free_path will do
    that for us.

    Signed-off-by: Li Dongyang
    Signed-off-by: Chris Mason

    Li Dongyang
     
  • We weren't reserving metadata space for rename, rmdir and unlink, which could
    cause problems.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • This patch fixes a problem where max_size can be set to 0 even though we
    filled the cluster properly. We set max_size to 0 if we restart the cluster
    window, but if the new start entry is big enough to be our new cluster then we
    could return with a max_size set to 0, which will mean the next time we try to
    allocate from this cluster it will fail. So set max_extent to the entry's
    size. Tested this on my box and now we actually allocate from the cluster
    after we fill it. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • We use journal_info to tell if we're in a nested transaction to make sure we
    don't commit the transaction within a nested transaction. We use another
    method to see if there are any outstanding ioctl trans handles, so if we're
    starting one do not set current->journal_info, since it will screw with other
    filesystems. This patch also cleans up the starting stuff so there aren't any
    magic numbers.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Sometimes our start allocation hint when we cow a file can be either
    EXTENT_HOLE or some other such place holder, which is not optimal. So if we
    find that our em->block_start is one of these special values, check to see
    where the first block of the inode is stored, and use that as a hint. If that
    block is also a special value, just fallback on a hint of 0 and let the
    allocator figure out a good place to put the data.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

11 Nov, 2009

3 commits

  • If journal init fails, we need to free j_wbuf.

    Cc: Andrew Morton
    Cc: Jan Kara
    Signed-off-by: Tao Ma
    Signed-off-by: Jan Kara

    Tao Ma
     
  • We cannot rely on buffer dirty bits during fsync because pdflush can come
    before fsync is called and clear dirty bits without forcing a transaction
    commit. What we do is that we track which transaction has last changed
    the inode and which transaction last changed allocation and force it to
    disk on fsync.

    Signed-off-by: Jan Kara
    Reviewed-by: Aneesh Kumar K.V

    Jan Kara
     
  • On a 256M 4k block filesystem, doing this in a loop:

    dd if=/dev/zero of=test oflag=direct bs=1M count=64
    rm -f test

    eventually leads to spurious ENOSPC:

    dd: writing `test': No space left on device

    As with other block allocation callers, it looks like we need to
    potentially retry the allocations on the initial ENOSPC.

    A similar patch went into ext4 (commit
    fbbf69456619de5d251cb9f1df609069178c62d5)

    Signed-off-by: Eric Sandeen
    Signed-off-by: Jan Kara

    Eric Sandeen
     

10 Nov, 2009

2 commits


09 Nov, 2009

1 commit

  • This is a partial revert of commit 6487a9d (only the changes made to
    fs/ext4/namei.c), since it is causing the following brelse()
    double-free warning when running fsstress on a file system with 1k
    blocksize and we run into a block allocation failure while converting
    a single-block directory to a multi-block hash-tree indexed directory.

    WARNING: at fs/buffer.c:1197 __brelse+0x2e/0x33()
    Hardware name:
    VFS: brelse: Trying to free free buffer
    Modules linked in:
    Pid: 2226, comm: jbd2/sdd-8 Not tainted 2.6.32-rc6-00577-g0003f55 #101
    Call Trace:
    [] warn_slowpath_common+0x65/0x95
    [] warn_slowpath_fmt+0x29/0x2c
    [] __brelse+0x2e/0x33
    [] jbd2_journal_refile_buffer+0x67/0x6c
    [] jbd2_journal_commit_transaction+0x319/0x14d8
    [] ? try_to_del_timer_sync+0x58/0x60
    [] ? sched_clock_cpu+0x12a/0x13e
    [] ? trace_hardirqs_off+0xb/0xd
    [] ? cpu_clock+0x3f/0x5b
    [] ? lock_release_holdtime+0x36/0x137
    [] ? _spin_unlock_irqrestore+0x44/0x51
    [] ? trace_hardirqs_on_caller+0x103/0x124
    [] ? trace_hardirqs_on+0xb/0xd
    [] ? try_to_del_timer_sync+0x58/0x60
    [] kjournald2+0x11a/0x310
    [] ? autoremove_wake_function+0x0/0x38
    [] ? kjournald2+0x0/0x310
    [] kthread+0x66/0x6b
    [] ? kthread+0x0/0x6b
    [] kernel_thread_helper+0x7/0x10
    ---[ end trace 5579351b86af61e3 ]---

    Commit 6487a9d was an attempt some buffer head leaks in an ENOSPC
    error path, but in some cases it actually results in an excess ENOSPC,
    as shown above. Fixing this means cleaning up who is responsible for
    releasing the buffer heads from the callee to the caller of
    add_dirent_to_buf().

    Since that's a relatively complex change, and we're late in the rcX
    development cycle, I'm reverting this now, and holding back a more
    complete fix until after 2.6.32 ships. We've lived with this
    buffer_head leak on ENOSPC in ext3 and ext4 for a very long time; a
    few more months won't kill us.

    Signed-off-by: "Theodore Ts'o"
    Cc: Curt Wohlgemuth

    Theodore Ts'o
     

08 Nov, 2009

2 commits

  • This fixes an -rc1 regression brought by the commit:
    1cf58fa840472ec7df6bf2312885949ebb308853 ("nilfs2: shorten freeze
    period due to GC in write operation v3").

    Although the patch moved out a function call of
    nilfs_ioctl_move_blocks() to nilfs_ioctl_clean_segments() from
    nilfs_ioctl_prepare_clean_segments(), it didn't move corresponding
    cleanup job needed for the error case.

    This will move the missing cleanup job to the destination function.

    Signed-off-by: Ryusuke Konishi
    Acked-by: Jiro SEKIBA

    Ryusuke Konishi
     
  • This fixes a kernel oops reported by Markus Trippelsdorf in the email
    titled "[NILFS users] kernel Oops while running nilfs_cleanerd".

    The oops was caused by a bug of error path in
    nilfs_ioctl_move_blocks() function, which was inlined in
    nilfs_ioctl_clean_segments().

    nilfs_ioctl_move_blocks checks duplication of blocks which will be
    moved in garbage collection. But, the check should have be done
    within nilfs_ioctl_move_inode_block() to prevent list corruption among
    buffers storing the target blocks.

    To fix the kernel oops, this moves forward the duplication check
    before the list insertion.

    I also tested this for stable trees [2.6.30, 2.6.31].

    Reported-by: Markus Trippelsdorf
    Signed-off-by: Ryusuke Konishi
    Cc: stable

    Ryusuke Konishi
     

07 Nov, 2009

2 commits

  • Because it's lighter weight, CIFS tries to use CIFSGetSrvInodeNumber to
    verify the accessibility of the root inode and then falls back to doing a
    full QPathInfo if that fails with -EOPNOTSUPP. I have at least a report
    of a server that returns NT_STATUS_INTERNAL_ERROR rather than something
    that translates to EOPNOTSUPP.

    Rather than trying to be clever with that call, just have
    is_path_accessible do a normal QPathInfo. That call is widely
    supported and it shouldn't increase the overhead significantly.

    Cc: Stable
    Signed-off-by: Jeff Layton
    Signed-off-by: Steve French

    Jeff Layton
     
  • It's possible that a server will return a valid FileID when we query the
    FILE_INTERNAL_INFO for the root inode, but then zeroed out inode numbers
    when we do a FindFile with an infolevel of
    SMB_FIND_FILE_ID_FULL_DIR_INFO.

    In this situation turn off querying for server inode numbers, generate a
    warning for the user and just generate an inode number using iunique.
    Once we generate any inode number with iunique we can no longer use any
    server inode numbers or we risk collisions, so ensure that we don't do
    that in cifs_get_inode_info either.

    Cc: Stable
    Reported-by: Timothy Normand Miller
    Signed-off-by: Jeff Layton
    Signed-off-by: Steve French

    Jeff Layton
     

06 Nov, 2009

5 commits


05 Nov, 2009

1 commit


04 Nov, 2009

6 commits

  • This patch fixes two issues in the procfs stack information on
    x86-64 linux.

    The 32 bit loader compat_do_execve did not store stack
    start. (this was figured out by Alexey Dobriyan).

    The stack information on a x64_64 kernel always shows 0 kbyte
    stack usage, because of a missing implementation of the KSTK_ESP
    macro which always returned -1.

    The new implementation now returns the right value.

    Signed-off-by: Stefani Seibold
    Cc: Americo Wang
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Andrew Morton
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Stefani Seibold
     
  • Invalidate the target's attributes, which may have changed (such as
    nlink, change time) so that they are refreshed on the next getattr().

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Looks like another victim of the confusing kmap() vs kmap_atomic() API
    differences.

    Reported-by: Todor Gyumyushev
    Signed-off-by: Jens Axboe
    Signed-off-by: Miklos Szeredi
    Cc: Tejun Heo
    Cc: stable@kernel.org

    Jens Axboe
     
  • fuse_direct_io() has a loop where requests are allocated in each
    iteration. if allocation fails, the loop is broken out and follows
    into an unconditional fuse_put_request() on that invalid pointer.

    Signed-off-by: Anand V. Avati
    Signed-off-by: Miklos Szeredi
    Cc: stable@kernel.org

    Anand V. Avati
     
  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    cfq-iosched: limit coop preemption
    cfq-iosched: fix bad return value cfq_should_preempt()
    backing-dev: bdi sb prune should be in the unregister path, not destroy
    Fix bio_alloc() and bio_kmalloc() documentation
    bio_put(): add bio_clone() to the list of functions in the comment

    Linus Torvalds
     
  • The ext4_debug() call in ext4_end_io_dio() should be moved after the
    check to make sure that io_end is non-NULL.

    The comment above ext4_get_block_dio_write() ("Maximum number of
    blocks...") is a duplicate; the original and correct comment is above
    the #define DIO_MAX_BLOCKS up above.

    Based on review comments from Curt Wohlgemuth.

    Signed-off-by: Mingming Cao
    Signed-off-by: "Theodore Ts'o"

    Mingming