04 Jul, 2013

1 commit

  • Pull second set of VFS changes from Al Viro:
    "Assorted f_pos race fixes, making do_splice_direct() safe to call with
    i_mutex on parent, O_TMPFILE support, Jeff's locks.c series,
    ->d_hash/->d_compare calling conventions changes from Linus, misc
    stuff all over the place."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    Document ->tmpfile()
    ext4: ->tmpfile() support
    vfs: export lseek_execute() to modules
    lseek_execute() doesn't need an inode passed to it
    block_dev: switch to fixed_size_llseek()
    cpqphp_sysfs: switch to fixed_size_llseek()
    tile-srom: switch to fixed_size_llseek()
    proc_powerpc: switch to fixed_size_llseek()
    ubi/cdev: switch to fixed_size_llseek()
    pci/proc: switch to fixed_size_llseek()
    isapnp: switch to fixed_size_llseek()
    lpfc: switch to fixed_size_llseek()
    locks: give the blocked_hash its own spinlock
    locks: add a new "lm_owner_key" lock operation
    locks: turn the blocked_list into a hashtable
    locks: convert fl_link to a hlist_node
    locks: avoid taking global lock if possible when waking up blocked waiters
    locks: protect most of the file_lock handling with i_lock
    locks: encapsulate the fl_link list handling
    locks: make "added" in __posix_lock_file a bool
    ...

    Linus Torvalds
     

03 Jul, 2013

3 commits

  • very similar to ext3 counterpart...

    Signed-off-by: Al Viro

    Al Viro
     
  • For those file systems(btrfs/ext4/ocfs2/tmpfs) that support
    SEEK_DATA/SEEK_HOLE functions, we end up handling the similar
    matter in lseek_execute() to update the current file offset
    to the desired offset if it is valid, ceph also does the
    simliar things at ceph_llseek().

    To reduce the duplications, this patch make lseek_execute()
    public accessible so that we can call it directly from the
    underlying file systems.

    Thanks Dave Chinner for this suggestion.

    [AV: call it vfs_setpos(), don't bring the removed 'inode' argument back]

    v2->v1:
    - Add kernel-doc comments for lseek_execute()
    - Call lseek_execute() in ceph->llseek()

    Signed-off-by: Jie Liu
    Cc: Dave Chinner
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Chris Mason
    Cc: Josef Bacik
    Cc: Ben Myers
    Cc: Ted Tso
    Cc: Hugh Dickins
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Sage Weil
    Signed-off-by: Al Viro

    Jie Liu
     
  • Pull ext4 update from Ted Ts'o:
    "Lots of bug fixes, cleanups and optimizations. In the bug fixes
    category, of note is a fix for on-line resizing file systems where the
    block size is smaller than the page size (i.e., file systems 1k blocks
    on x86, or more interestingly file systems with 4k blocks on Power or
    ia64 systems.)

    In the cleanup category, the ext4's punch hole implementation was
    significantly improved by Lukas Czerner, and now supports bigalloc
    file systems. In addition, Jan Kara significantly cleaned up the
    write submission code path. We also improved error checking and added
    a few sanity checks.

    In the optimizations category, two major optimizations deserve
    mention. The first is that ext4_writepages() is now used for
    nodelalloc and ext3 compatibility mode. This allows writes to be
    submitted much more efficiently as a single bio request, instead of
    being sent as individual 4k writes into the block layer (which then
    relied on the elevator code to coalesce the requests in the block
    queue). Secondly, the extent cache shrink mechanism, which was
    introduce in 3.9, no longer has a scalability bottleneck caused by the
    i_es_lru spinlock. Other optimizations include some changes to reduce
    CPU usage and to avoid issuing empty commits unnecessarily."

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
    ext4: optimize starting extent in ext4_ext_rm_leaf()
    jbd2: invalidate handle if jbd2_journal_restart() fails
    ext4: translate flag bits to strings in tracepoints
    ext4: fix up error handling for mpage_map_and_submit_extent()
    jbd2: fix theoretical race in jbd2__journal_restart
    ext4: only zero partial blocks in ext4_zero_partial_blocks()
    ext4: check error return from ext4_write_inline_data_end()
    ext4: delete unnecessary C statements
    ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree()
    jbd2: move superblock checksum calculation to jbd2_write_superblock()
    ext4: pass inode pointer instead of file pointer to punch hole
    ext4: improve free space calculation for inline_data
    ext4: reduce object size when !CONFIG_PRINTK
    ext4: improve extent cache shrink mechanism to avoid to burn CPU time
    ext4: implement error handling of ext4_mb_new_preallocation()
    ext4: fix corruption when online resizing a fs with 1K block size
    ext4: delete unused variables
    ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents
    jbd2: remove debug dependency on debug_fs and update Kconfig help text
    jbd2: use a single printk for jbd_debug()
    ...

    Linus Torvalds
     

01 Jul, 2013

13 commits

  • Both hole punch and truncate use ext4_ext_rm_leaf() for removing
    blocks. Currently we choose the last extent as the starting
    point for removing blocks:

    ex = EXT_LAST_EXTENT(eh);

    This is OK for truncate but for hole punch we can optimize the extent
    selection as the path is already initialized. We could use this
    information to select proper starting extent. The code change in this
    patch will not affect truncate as for truncate path[depth].p_ext will
    always be NULL.

    Signed-off-by: Ashish Sangwan
    Signed-off-by: Namjae Jeon
    Signed-off-by: "Theodore Ts'o"

    Ashish Sangwan
     
  • Translate the bitfields used in various flags argument to strings to
    make the tracepoint output more human-readable.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • The function mpage_released_unused_page() must only be called once;
    otherwise the kernel will BUG() when the second call to
    mpage_released_unused_page() tries to unlock the pages which had been
    unlocked by the first call.

    Also restructure the error handling so that we only give up on writing
    the dirty pages in the case of ENOSPC where retrying the allocation
    won't help. Otherwise, a transient failure, such as a kmalloc()
    failure in calling ext4_map_blocks() might cause us to give up on
    those pages, leading to a scary message in /var/log/messages plus data
    loss.

    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara

    Theodore Ts'o
     
  • Currently if we pass range into ext4_zero_partial_blocks() which covers
    entire block we would attempt to zero it even though we should only zero
    unaligned part of the block.

    Fix this by checking whether the range covers the whole block skip
    zeroing if so.

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"

    Lukas Czerner
     
  • The function ext4_write_inline_data_end() can return an error. So we
    need to assign it to a signed integer variable to check for an error
    return (since copied is an unsigned int).

    Signed-off-by: "Theodore Ts'o"
    Cc: Zheng Liu
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     
  • Comparing unsigned variable with 0 always returns false.
    err = 0 is duplicated and unnecessary.

    [ tytso: Also cleaned up error handling in ext4_block_zero_page_range() ]

    Signed-off-by: "Jon Ernst"
    Signed-off-by: "Theodore Ts'o"

    jon ernst
     
  • Both ext3 and ext4 htree_dirblock_to_tree() is just filling the
    in-core rbtree for use by call_filldir(). All updates of ->f_pos are
    done by the latter; bumping it here (on error) is obviously wrong - we
    might very well have it nowhere near the block we'd found an error in.

    Signed-off-by: Al Viro
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@vger.kernel.org

    Al Viro
     
  • No need to pass file pointer when we can directly pass inode pointer.

    Signed-off-by: Ashish Sangwan
    Signed-off-by: Namjae Jeon
    Signed-off-by: "Theodore Ts'o"

    Ashish Sangwan
     
  • In ext4 feature inline_data,it use the xattr's space to store the
    inline data in inode.When we calculate the inline data as the xattr,we
    add the pad.But in get_max_inline_xattr_value_size() function we count
    the free space without pad.It cause some contents are moved to a block
    even if it can be
    stored in the inode.

    Signed-off-by: liulei
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Tao Ma

    boxi liu
     
  • Reduce the object size ~10% could be useful for embedded systems.

    Add #ifdef CONFIG_PRINTK #else #endif blocks to hold formats and
    arguments, passing " " to functions when !CONFIG_PRINTK and still
    verifying format and arguments with no_printk.

    $ size fs/ext4/built-in.o*
    text data bss dec hex filename
    239375 610 888 240873 3ace9 fs/ext4/built-in.o.new
    264167 738 888 265793 40e41 fs/ext4/built-in.o.old

    $ grep -E "CONFIG_EXT4|CONFIG_PRINTK" .config
    # CONFIG_PRINTK is not set
    CONFIG_EXT4_FS=y
    CONFIG_EXT4_USE_FOR_EXT23=y
    CONFIG_EXT4_FS_POSIX_ACL=y
    # CONFIG_EXT4_FS_SECURITY is not set
    # CONFIG_EXT4_DEBUG is not set

    Signed-off-by: Joe Perches
    Signed-off-by: "Theodore Ts'o"

    Joe Perches
     
  • Now we maintain an proper in-order LRU list in ext4 to reclaim entries
    from extent status tree when we are under heavy memory pressure. For
    keeping this order, a spin lock is used to protect this list. But this
    lock burns a lot of CPU time. We can use the following steps to trigger
    it.

    % cd /dev/shm
    % dd if=/dev/zero of=ext4-img bs=1M count=2k
    % mkfs.ext4 ext4-img
    % mount -t ext4 -o loop ext4-img /mnt
    % cd /mnt
    % for ((i=0;i

    Zheng Liu
     
  • If memory allocation in ext4_mb_new_group_pa() is failed,
    it returns error code, ext4_mb_new_preallocation() propages it,
    but ext4_mb_new_blocks() ignores it.

    An observed result was:

    - allocation fail means ext4_mb_new_group_pa() does not update
    ext4_allocation_context;

    - ext4_mb_new_blocks() sets ext4_allocation_request->len (ar->len =
    ac->ac_b_ex.fe_len;) to number of blocks preallocated (512) instead
    of number of blocks requested (1);

    - that activates update cycle in ext4_splice_branch():
    for (i = 1; i < blks; i++) p + i) = cpu_to_le32(current_block++);

    - it iterates 511 times and corrupts a chunk of memory including inode
    structure;

    - page fault happens at EXT4_SB(inode->i_sb) in ext4_mark_inode_dirty();

    - system hangs with 'scheduling while atomic' BUG.

    The patch implements a check for ext4_mb_new_preallocation() error
    code and handles its failure as if ext4_mb_regular_allocator() fails.

    Found by Linux File System Verification project (linuxtesting.org).

    [ Patch restructed by tytso to make the flow of control easier to follow. ]

    Signed-off-by: Alexey Khoroshilov
    Signed-off-by: "Theodore Ts'o"

    Alexey Khoroshilov
     
  • Subtracting the number of the first data block places the superblock
    backups one block too early, corrupting the file system. When the block
    size is larger than 1K, the first data block is 0, so the subtraction
    has no effect and no corruption occurs.

    Signed-off-by: Maarten ter Huurne
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara
    CC: stable@vger.kernel.org

    Maarten ter Huurne
     

29 Jun, 2013

1 commit


17 Jun, 2013

1 commit


13 Jun, 2013

3 commits


12 Jun, 2013

2 commits

  • Commit 18888cf0883c: "ext4: speed up truncate/unlink by not using
    bforget() unless needed" removed the use of EXT4_FREE_BLOCKS_FORGET in
    the most important codepath for file systems using extents, but a
    similar optimization also can be done for file systems using indirect
    blocks, and for the two special cases in the ext4 extents code.

    Cc: Andrey Sidorov
    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • For a file systems with a very large number of block groups, if all of
    the block group bitmaps are in memory and the file system is
    relatively badly fragmented, it's possible ext4_mb_regular_allocator()
    to take a long time trying to find a good match. This is especially
    true if the tuning parameter mb_max_to_scan has been sent to a very
    large number. So add a cond_resched() to avoid soft lockup warnings
    and to provide better system responsiveness.

    For ext4_free_blocks(), if we are deleting a large range of blocks,
    and data=journal is enabled so that EXT4_FREE_BLOCKS_FORGET is passed,
    the loop to call sb_find_get_block() and to call ext4_forget() can
    take over 10-15 milliseocnds or more. So it's better to add a
    cond_resched() here a well.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

07 Jun, 2013

1 commit

  • Rename ext4_da_writepages() to ext4_writepages() and use it for all
    modes. We still need to iterate over all the pages in the case of
    data=journalling, but in the case of nodelalloc/data=ordered (which is
    what file systems mounted using ext3 backwards compatibility will use)
    this will allow us to use a much more efficient I/O submission path.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

06 Jun, 2013

4 commits


05 Jun, 2013

11 commits

  • Now that we clear PageWriteback after extent conversion, there's no
    need to wait for io_end processing in ext4_evict_inode(). Running
    AIO/DIO keeps file reference until aio_complete() is called so
    ext4_evict_inode() cannot be called. For io_end structures resulting
    from buffered IO waiting is happening because we wait for
    PageWriteback in truncate_inode_pages().

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • We don't have to wait for extent conversion in ext4_punch_hole() as
    buffered IO for the punched range has been flushed and waited upon
    (thus all extent conversions for that range have completed). Also we
    wait for all DIO to finish using inode_dio_wait() so there cannot be
    any extent conversions pending due to direct IO.

    Also remove ext4_flush_unwritten_io() since it's unused now.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • We don't have to wait for unwritten extent conversion in
    ext4_ind_direct_IO() as all writes that happened before DIO are
    flushed by the generic code and extent conversion has happened before
    we cleared PageWriteback bit.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • After removal of ext4_flush_unwritten_io() call, ext4_file_sync()
    doesn't need i_mutex anymore. Forcing of transaction commits doesn't
    need i_mutex as there's nothing inode specific in that code apart from
    grabbing transaction ids from the inode. So remove the lock.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • Just use the generic function instead of duplicating it. We only need
    to reshuffle the read-only check a bit (which is there to prevent
    writing to a filesystem which has been remounted read-only after error
    I assume).

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • Since PageWriteback bit is now cleared after extents are converted
    from unwritten to written ones, we have full exclusion of writeback
    path from truncate (truncate_inode_pages() waits for PageWriteback
    bits to get cleared on all invalidated pages). Exclusion from DIO
    path is achieved by inode_dio_wait() call in ext4_setattr(). So
    there's no need to wait for extent convertion in ext4_truncate()
    anymore.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • Make sure extent conversion after DIO happens while i_dio_count is
    still elevated so that inode_dio_wait() waits until extent conversion
    is done. This removes the need for explicit waiting for extent
    conversion in some cases.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • Currently PageWriteback bit gets cleared from put_io_page() called
    from ext4_end_bio(). This is somewhat inconvenient as extent tree is
    not fully updated at that time (unwritten extents are not marked as
    written) so we cannot read the data back yet. This design was
    dictated by lock ordering as we cannot start a transaction while
    PageWriteback bit is set (we could easily deadlock with
    ext4_da_writepages()). But now that we use transaction reservation
    for extent conversion, locking issues are solved and we can move
    PageWriteback bit clearing after extent conversion is done. As a
    result we can remove wait for unwritten extent conversion from
    ext4_sync_file() because it already implicitely happens through
    wait_on_page_writeback().

    We implement deferring of PageWriteback clearing by queueing completed
    bios to appropriate io_end and processing all the pages when io_end is
    going to be freed instead of at the moment ext4_io_end() is called.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • Now that we have extent conversions with reserved transaction, we have
    to prevent extent conversions without reserved transaction (from DIO
    code) to block these (as that would effectively void any transaction
    reservation we did). So split lists, work items, and work queues to
    reserved and unreserved parts.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • Later we would like to clear PageWriteback bit only after extent
    conversion from unwritten to written extents is performed. However it
    is not possible to start a transaction after PageWriteback is set
    because that violates lock ordering (and is easy to deadlock). So we
    have to reserve a transaction before locking pages and sending them
    for IO and later we use the transaction for extent conversion from
    ext4_end_io().

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • There isn't any need for setting BH_Uninit on buffers anymore. It was
    only used to signal we need to mark io_end as needing extent
    conversion in add_bh_to_extent() but now we can mark the io_end
    directly when mapping extent.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara