22 Sep, 2011

1 commit

  • * 'for-linus' of git://git.kernel.dk/linux-block:
    floppy: use del_timer_sync() in init cleanup
    blk-cgroup: be able to remove the record of unplugged device
    block: Don't check QUEUE_FLAG_SAME_COMP in __blk_complete_request
    mm: Add comment explaining task state setting in bdi_forker_thread()
    mm: Cleanup clearing of BDI_pending bit in bdi_forker_thread()
    block: simplify force plug flush code a little bit
    block: change force plug flush call order
    block: Fix queue_flag update when rq_affinity goes from 2 to 1
    block: separate priority boosting from REQ_META
    block: remove READ_META and WRITE_META
    xen-blkback: fixed indentation and comments
    xen-blkback: Don't disconnect backend until state switched to XenbusStateClosed.

    Linus Torvalds
     

31 Aug, 2011

1 commit

  • The i_mutex lock and flush_completed_IO() added by commit 2581fdc810
    in ext4_evict_inode() causes lockdep complaining about potential
    deadlock in several places. In most/all of these LOCKDEP complaints
    it looks like it's a false positive, since many of the potential
    circular locking cases can't take place by the time the
    ext4_evict_inode() is called; but since at the very least it may mask
    real problems, we need to address this.

    This change removes the flush_completed_IO() and i_mutex lock in
    ext4_evict_inode(). Instead, we take a different approach to resolve
    the software lockup that commit 2581fdc810 intends to fix. Rather
    than having ext4-dio-unwritten thread wait for grabing the i_mutex
    lock of an inode, we use mutex_trylock() instead, and simply requeue
    the work item if we fail to grab the inode's i_mutex lock.

    This should speed up work queue processing in general and also
    prevents the following deadlock scenario: During page fault,
    shrink_icache_memory is called that in turn evicts another inode B.
    Inode B has some pending io_end work so it calls ext4_ioend_wait()
    that waits for inode B's i_ioend_count to become zero. However, inode
    B's ioend work was queued behind some of inode A's ioend work on the
    same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten
    thread on that cpu is processing inode A's ioend work, it tries to
    grab inode A's i_mutex lock. Since the i_mutex lock of inode A is
    still hold before the page fault happened, we enter a deadlock.

    Signed-off-by: Jiaying Zhang
    Signed-off-by: "Theodore Ts'o"

    Jiaying Zhang
     

23 Aug, 2011

2 commits


21 Aug, 2011

1 commit

  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: flush any pending end_io requests before DIO reads w/dioread_nolock
    ext4: fix nomblk_io_submit option so it correctly converts uninit blocks
    ext4: Resolve the hang of direct i/o read in handling EXT4_IO_END_UNWRITTEN.
    ext4: call ext4_ioend_wait and ext4_flush_completed_IO in ext4_evict_inode
    ext4: Fix ext4_should_writeback_data() for no-journal mode

    Linus Torvalds
     

20 Aug, 2011

1 commit

  • There is a race between ext4 buffer write and direct_IO read with
    dioread_nolock mount option enabled. The problem is that we clear
    PageWriteback flag during end_io time but will do
    uninitialized-to-initialized extent conversion later with dioread_nolock.
    If an O_direct read request comes in during this period, ext4 will return
    zero instead of the recently written data.

    This patch checks whether there are any pending uninitialized-to-initialized
    extent conversion requests before doing O_direct read to close the race.
    Note that this is just a bandaid fix. The fundamental issue is that we
    clear PageWriteback flag before we really complete an IO, which is
    problem-prone. To fix the fundamental issue, we may need to implement an
    extent tree cache that we can use to look up pending to-be-converted extents.

    Signed-off-by: Jiaying Zhang
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Jiaying Zhang
     

14 Aug, 2011

3 commits

  • Bug discovered by Jan Kara:

    Finally, commit 1449032be17abb69116dbc393f67ceb8bd034f92 returned back
    the old IO submission code but apparently it forgot to return the old
    handling of uninitialized buffers so we unconditionnaly call
    block_write_full_page() without specifying end_io function. So AFAICS
    we never convert unwritten extents to written in some cases. For
    example when I mount the fs as: mount -t ext4 -o
    nomblk_io_submit,dioread_nolock /dev/ubdb /mnt and do
    int fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0600);
    char buf[1024];
    memset(buf, 'a', sizeof(buf));
    fallocate(fd, 0, 0, 16384);
    write(fd, buf, sizeof(buf));

    I get a file full of zeros (after remounting the filesystem so that
    pagecache is dropped) instead of seeing the first KB contain 'a's.

    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Theodore Ts'o
     
  • EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten
    should be done simultaneously since ext4_end_io_nolock always clear
    the flag and decrease the counter in the same time.

    We don't increase i_aiodio_unwritten when setting
    EXT4_IO_END_UNWRITTEN so it will go nagative and causes some process
    to wait forever.

    Part of the patch came from Eric in his e-mail, but it doesn't fix the
    problem met by Michael actually.

    http://marc.info/?l=linux-ext4&m=131316851417460&w=2

    Reported-and-Tested-by: Michael Tokarev
    Signed-off-by: Eric Sandeen
    Signed-off-by: Tao Ma
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Tao Ma
     
  • Flush inode's i_completed_io_list before calling ext4_io_wait to
    prevent the following deadlock scenario: A page fault happens while
    some process is writing inode A. During page fault,
    shrink_icache_memory is called that in turn evicts another inode
    B. Inode B has some pending io_end work so it calls ext4_ioend_wait()
    that waits for inode B's i_ioend_count to become zero. However, inode
    B's ioend work was queued behind some of inode A's ioend work on the
    same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten
    thread on that cpu is processing inode A's ioend work, it tries to
    grab inode A's i_mutex lock. Since the i_mutex lock of inode A is
    still hold before the page fault happened, we enter a deadlock.

    Also moves ext4_flush_completed_IO and ext4_ioend_wait from
    ext4_destroy_inode() to ext4_evict_inode(). During inode deleteion,
    ext4_evict_inode() is called before ext4_destroy_inode() and in
    ext4_evict_inode(), we may call ext4_truncate() without holding
    i_mutex lock. As a result, there is a race between flush_completed_IO
    that is called from ext4_ext_truncate() and ext4_end_io_work, which
    may cause corruption on an io_end structure. This change moves
    ext4_flush_completed_IO and ext4_ioend_wait from ext4_destroy_inode()
    to ext4_evict_inode() to resolve the race between ext4_truncate() and
    ext4_end_io_work during inode deletion.

    Signed-off-by: Jiaying Zhang
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Jiaying Zhang
     

13 Aug, 2011

1 commit

  • ext4_should_writeback_data() had an incorrect sequence of
    tests to determine if it should return 0 or 1: in
    particular, even in no-journal mode, 0 was being returned
    for a non-regular-file inode.

    This meant that, in non-journal mode, we would use
    ext4_journalled_aops for directories, symlinks, and other
    non-regular files. However, calling journalled aop
    callbacks when there is no valid handle, can cause problems.

    This would cause a kernel crash with Jan Kara's commit
    2d859db3e4 ("ext4: fix data corruption in inodes with
    journalled data"), because we now dereference 'handle' in
    ext4_journalled_write_end().

    I also added BUG_ONs to check for a valid handle in the
    obviously journal-only aops callbacks.

    I tested this running xfstests with a scratch device in
    these modes:

    - no-journal
    - data=ordered
    - data=writeback
    - data=journal

    All work fine; the data=journal run has many failures and a
    crash in xfstests 074, but this is no different from a
    vanilla kernel.

    Signed-off-by: Curt Wohlgemuth
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Curt Wohlgemuth
     

12 Aug, 2011

1 commit

  • Commit df5e6223407e ("ext4: fix deadlock in ext4_symlink() in ENOSPC
    conditions") recalculated the number of credits needed for a long
    symlink, in the process of splitting it into two transactions. However,
    the first credit calculation under-counted because if selinux is
    enabled, credits are needed to create the selinux xattr as well.

    Overrunning the reservation will result in an OOPS in
    jbd2_journal_dirty_metadata() due to this assert:

    J_ASSERT_JH(jh, handle->h_buffer_credits > 0);

    Fix this by increasing the reservation size.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Jan Kara
    Acked-by: "Theodore Ts'o"
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     

04 Aug, 2011

1 commit

  • Commit 9933fc0i (ext4: introduce ext4_kvmalloc(), ext4_kzalloc(), and
    ext4_kvfree()) intruduced wrappers around k*alloc/vmalloc but introduced
    a typo for ext4_kzalloc() by not using kzalloc() but kmalloc().

    Signed-off-by: Mathias Krause
    Signed-off-by: "Theodore Ts'o"

    Mathias Krause
     

02 Aug, 2011

4 commits

  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (60 commits)
    ext4: prevent memory leaks from ext4_mb_init_backend() on error path
    ext4: use EXT4_BAD_INO for buddy cache to avoid colliding with valid inode #
    ext4: use ext4_msg() instead of printk in mballoc
    ext4: use ext4_kvzalloc()/ext4_kvmalloc() for s_group_desc and s_group_info
    ext4: introduce ext4_kvmalloc(), ext4_kzalloc(), and ext4_kvfree()
    ext4: use the correct error exit path in ext4_init_inode_table()
    ext4: add missing kfree() on error return path in add_new_gdb()
    ext4: change umode_t in tracepoint headers to be an explicit __u16
    ext4: fix races in ext4_sync_parent()
    ext4: Fix overflow caused by missing cast in ext4_fallocate()
    ext4: add action of moving index in ext4_ext_rm_idx for Punch Hole
    ext4: simplify parameters of reserve_backup_gdb()
    ext4: simplify parameters of add_new_gdb()
    ext4: remove lock_buffer in bclean() and setup_new_group_blocks()
    ext4: simplify journal handling in setup_new_group_blocks()
    ext4: let setup_new_group_blocks() set multiple bits at a time
    ext4: fix a typo in ext4_group_extend()
    ext4: let ext4_group_add_blocks() handle 0 blocks quickly
    ext4: let ext4_group_add_blocks() return an error code
    ext4: rename ext4_add_groupblocks() to ext4_group_add_blocks()
    ...

    Fix up conflict in fs/ext4/inode.c: commit aacfc19c626e ("fs: simplify
    the blockdev_direct_IO prototype") had changed the ext4_ind_direct_IO()
    function for the new simplified calling convention, while commit
    dae1e52cb126 ("ext4: move ext4_ind_* functions from inode.c to
    indirect.c") moved the function to another file.

    Linus Torvalds
     
  • In ext4_mb_init(), if the s_locality_group allocation fails it will
    currently cause the allocations made in ext4_mb_init_backend() to
    be leaked. Moving the ext4_mb_init_backend() allocation after the
    s_locality_group allocation avoids that problem.

    Signed-off-by: Yu Jian
    Signed-off-by: Andreas Dilger
    Signed-off-by: "Theodore Ts'o"

    Yu Jian
     
  • Signed-off-by: Yu Jian
    Signed-off-by: Andreas Dilger
    Signed-off-by: "Theodore Ts'o"

    Yu Jian
     
  • Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

01 Aug, 2011

5 commits


31 Jul, 2011

2 commits

  • We added some more error handling in b40971426a "ext4: add error
    checking to calls to ext4_handle_dirty_metadata()". But we need to
    call kfree() as well to avoid a memory leak.

    Signed-off-by: Dan Carpenter
    Signed-off-by: "Theodore Ts'o"

    Dan Carpenter
     
  • Fix problems if fsync() races against a rename of a parent directory
    as pointed out by Al Viro in his own inimitable way:

    >While we are at it, could somebody please explain what the hell is ext4
    >doing in
    >static int ext4_sync_parent(struct inode *inode)
    >{
    > struct writeback_control wbc;
    > struct dentry *dentry = NULL;
    > int ret = 0;
    >
    > while (inode && ext4_test_inode_state(inode, EXT4_STATE_NEWENTRY)) {
    > ext4_clear_inode_state(inode, EXT4_STATE_NEWENTRY);
    > dentry = list_entry(inode->i_dentry.next,
    > struct dentry, d_alias);
    > if (!dentry || !dentry->d_parent || !dentry->d_parent->d_inode)
    > break;
    > inode = dentry->d_parent->d_inode;
    > ret = sync_mapping_buffers(inode->i_mapping);
    > ...
    >Note that dentry obviously can't be NULL there. dentry->d_parent is never
    >NULL. And dentry->d_parent would better not be negative, for crying out
    >loud! What's worse, there's no guarantees that dentry->d_parent will
    >remain our parent over that sync_mapping_buffers() *and* that inode won't
    >just be freed under us (after rename() and memory pressure leading to
    >eviction of what used to be our dentry->d_parent)......

    Reported-by: Al Viro
    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

28 Jul, 2011

5 commits

  • The logical block number in map.l_blk is a __u32, and so before we
    shift it left, by the block size, we neeed cast it to a 64-bit size.

    Otherwise i_size can be corrupted on an ENOSPC.

    # df -T /mnt/mp1
    Filesystem Type 1K-blocks Used Available Use% Mounted on
    /dev/sda6 ext4 9843276 153056 9190200 2% /mnt/mp1
    # fallocate -o 0 -l 2199023251456 /mnt/mp1/testfile
    fallocate: /mnt/mp1/testfile: fallocate failed: No space left on device
    # stat /mnt/mp1/testfile
    File: `/mnt/mp1/testfile'
    Size: 4293656576 Blocks: 19380440 IO Block: 4096 regular file
    Device: 806h/2054d Inode: 12 Links: 1
    Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
    Access: 2011-07-25 13:01:31.414490496 +0900
    Modify: 2011-07-25 13:01:31.414490496 +0900
    Change: 2011-07-25 13:01:31.454490495 +0900

    Signed-off-by: Utako Kusaka
    Signed-off-by: "Theodore Ts'o"
    --
    fs/ext4/extents.c | 2 +-
    1 files changed, 1 insertions(+), 1 deletions(-)

    Utako Kusaka
     
  • The old function ext4_ext_rm_idx is used only for truncate case
    because it just remove last index in extent-index-block. When punching
    hole, it usually needed to remove "middle" index, therefore we must
    move indexes which after it forward.

    (I create a file with 1 depth extent tree and punch hole in the middle
    of it, the last index in index-block strangly gone, so I find out this
    bug)

    Signed-off-by: Robin Dong
    Signed-off-by: "Theodore Ts'o"

    Robin Dong
     
  • The reserve_backup_gdb() function only needs the block group number;
    there's no need to pass a pointer to struct ext4_new_group_data to it.

    Signed-off-by: Yongqiang Yang

    Yongqiang Yang
     
  • add_new_gdb() only needs the block group number; there is no need to
    pass a pointer to struct ext4_new_group_data to add_new_gdb().
    Instead of filling in a pointer the struct buffer_head in
    add_new_gdb(), it's simpler to have the caller fetch it from the
    s_group_desc[] array.

    [Fixed error path to handle the case where struct buffer_head *primary
    hasn't been set yet. -- Ted]

    Signed-off-by: Yongqiang Yang
    Signed-off-by: "Theodore Ts'o"

    Yongqiang Yang
     
  • There is no need to lock the buffers since no one else should be
    touching these buffers besides the file system.

    Signed-off-by: Yongqiang Yang
    Signed-off-by: "Theodore Ts'o"

    Yongqiang Yang
     

27 Jul, 2011

9 commits

  • This patch simplifies journal handling in setup_new_group_blocks().

    In previous code, block bitmap is modified everywhere in
    setup_new_group_blocks(), ext4_get_write_access() in
    extend_or_restart_transaction() is used to guarantee that the block
    bitmap stays in the new handle, this makes things complicated.

    The previous commit changed things so that the modifications on the
    block bitmap are batched and done by ext4_set_bits() at the end of the
    for loop. This allows us to simplify things.

    Signed-off-by: Yongqiang Yang
    Signed-off-by: "Theodore Ts'o"

    Yongqiang Yang
     
  • Rename mb_set_bits() to ext4_set_bits() and make it a global function
    so that setup_new_group_blocks() can use it.

    Signed-off-by: Yongqiang Yang
    Signed-off-by: "Theodore Ts'o"

    Yongqiang Yang
     
  • Signed-off-by: Yongqiang Yang
    Signed-off-by: "Theodore Ts'o"

    Yongqiang Yang
     
  • If ext4_group_add_blocks() is called with 0 block, make it return 0
    without doing any extra work.

    Signed-off-by: Yongqiang Yang
    Signed-off-by: "Theodore Ts'o"

    Yongqiang Yang
     
  • This patch lets ext4_group_add_blocks() return an error code if it
    fails, so that upper functions can handle error correctly.

    Signed-off-by: Yongqiang Yang
    Signed-off-by: "Theodore Ts'o"

    Yongqiang Yang
     
  • Signed-off-by: Yongqiang Yang
    Signed-off-by: "Theodore Ts'o"

    Yongqiang Yang
     
  • A filesystem with errors is not allowed to being resized, otherwise,
    it is easy to destroy the filesystem.

    Signed-off-by: Yongqiang Yang
    Signed-off-by: "Theodore Ts'o"

    Yongqiang Yang
     
  • Before this patch, parallel resizers are allowed and protected by a
    mutex lock, actually, there is no need to support parallel resizer, so
    this patch prevents parallel resizers by atmoic bit ops, like
    lock_page() and unlock_page() do.

    To do this, the patch removed the mutex lock s_resize_lock from struct
    ext4_sb_info and added a unsigned long field named s_resize_flags
    which inidicates if there is a resizer.

    Signed-off-by: Yongqiang Yang
    Signed-off-by: "Theodore Ts'o"

    Yongqiang Yang
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: (27 commits)
    mm: properly reflect task dirty limits in dirty_exceeded logic
    writeback: don't busy retry writeback on new/freeing inodes
    writeback: scale IO chunk size up to half device bandwidth
    writeback: trace global_dirty_state
    writeback: introduce max-pause and pass-good dirty limits
    writeback: introduce smoothed global dirty limit
    writeback: consolidate variable names in balance_dirty_pages()
    writeback: show bdi write bandwidth in debugfs
    writeback: bdi write bandwidth estimation
    writeback: account per-bdi accumulated written pages
    writeback: make writeback_control.nr_to_write straight
    writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr()
    writeback: trace event writeback_queue_io
    writeback: trace event writeback_single_inode
    writeback: remove .nonblocking and .encountered_congestion
    writeback: remove writeback_control.more_io
    writeback: skip balance_dirty_pages() for in-memory fs
    writeback: add bdi_dirty_limit() kernel-doc
    writeback: avoid extra sync work at enqueue time
    writeback: elevate queue_io() into wb_writeback()
    ...

    Fix up trivial conflicts in fs/fs-writeback.c and mm/filemap.c

    Linus Torvalds
     

26 Jul, 2011

3 commits

  • When journalling data for an inode (either because it is a symlink or
    because the filesystem is mounted in data=journal mode), ext4_evict_inode()
    can discard unwritten data by calling truncate_inode_pages(). This is
    because we don't mark the buffer / page dirty when journalling data but only
    add the buffer to the running transaction and thus mm does not know there
    are still unwritten data.

    Fix the problem by carefully tracking transaction containing inode's data,
    committing this transaction, and writing uncheckpointed buffers when inode
    should be reaped.

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • Replace the ->check_acl method with a ->get_acl method that simply reads an
    ACL from disk after having a cache miss. This means we can replace the ACL
    checking boilerplate code with a single implementation in namei.c.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • new helper: posix_acl_create(&acl, gfp, mode_p). Replaces acl with
    modified clone, on failure releases acl and replaces with NULL.
    Returns 0 or -ve on error. All callers of posix_acl_create_masq()
    switched.

    Signed-off-by: Al Viro

    Al Viro