09 Jun, 2014

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "Clean ups and miscellaneous bug fixes, in particular for the new
    collapse_range and zero_range fallocate functions. In addition,
    improve the scalability of adding and remove inodes from the orphan
    list"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (25 commits)
    ext4: handle symlink properly with inline_data
    ext4: fix wrong assert in ext4_mb_normalize_request()
    ext4: fix zeroing of page during writeback
    ext4: remove unused local variable "stored" from ext4_readdir(...)
    ext4: fix ZERO_RANGE test failure in data journalling
    ext4: reduce contention on s_orphan_lock
    ext4: use sbi in ext4_orphan_{add|del}()
    ext4: use EXT_MAX_BLOCKS in ext4_es_can_be_merged()
    ext4: add missing BUFFER_TRACE before ext4_journal_get_write_access
    ext4: remove unnecessary double parentheses
    ext4: do not destroy ext4_groupinfo_caches if ext4_mb_init() fails
    ext4: make local functions static
    ext4: fix block bitmap validation when bigalloc, ^flex_bg
    ext4: fix block bitmap initialization under sparse_super2
    ext4: find the group descriptors on a 1k-block bigalloc,meta_bg filesystem
    ext4: avoid unneeded lookup when xattr name is invalid
    ext4: fix data integrity sync in ordered mode
    ext4: remove obsoleted check
    ext4: add a new spinlock i_raw_lock to protect the ext4's raw inode
    ext4: fix locking for O_APPEND writes
    ...

    Linus Torvalds
     

05 Jun, 2014

1 commit

  • The last in-tree caller of block_write_full_page_endio() was removed in
    January 2013. It's time to remove the EXPORT_SYMBOL, which leaves
    block_write_full_page() as the only caller of
    block_write_full_page_endio(), so inline block_write_full_page_endio()
    into block_write_full_page().

    Signed-off-by: Matthew Wilcox
    Cc: Hugh Dickins
    Cc: Dave Chinner
    Cc: Dheeraj Reddy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

28 May, 2014

1 commit

  • Tail of a page straddling inode size must be zeroed when being written
    out due to POSIX requirement that modifications of mmaped page beyond
    inode size must not be written to the file. ext4_bio_write_page() did
    this only for blocks fully beyond inode size but didn't properly zero
    blocks partially beyond inode size. Fix this.

    The problem has been uncovered by mmap_11-4 test in openposix test suite
    (part of LTP).

    Reported-by: Xiaoguang Wang
    Fixes: 5a0dc7365c240
    Fixes: bd2d0210cf22f
    CC: stable@vger.kernel.org
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

12 May, 2014

1 commit

  • When we perform a data integrity sync we tag all the dirty pages with
    PAGECACHE_TAG_TOWRITE at start of ext4_da_writepages. Later we check
    for this tag in write_cache_pages_da and creates a struct
    mpage_da_data containing contiguously indexed pages tagged with this
    tag and sync these pages with a call to mpage_da_map_and_submit. This
    process is done in while loop until all the PAGECACHE_TAG_TOWRITE
    pages are synced. We also do journal start and stop in each iteration.
    journal_stop could initiate journal commit which would call
    ext4_writepage which in turn will call ext4_bio_write_page even for
    delayed OR unwritten buffers. When ext4_bio_write_page is called for
    such buffers, even though it does not sync them but it clears the
    PAGECACHE_TAG_TOWRITE of the corresponding page and hence these pages
    are also not synced by the currently running data integrity sync. We
    will end up with dirty pages although sync is completed.

    This could cause a potential data loss when the sync call is followed
    by a truncate_pagecache call, which is exactly the case in
    collapse_range. (It will cause generic/127 failure in xfstests)

    To avoid this issue, we can use set_page_writeback_keepwrite instead of
    set_page_writeback, which doesn't clear TOWRITE tag.

    Cc: stable@vger.kernel.org
    Signed-off-by: Namjae Jeon
    Signed-off-by: Ashish Sangwan
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara

    Namjae Jeon
     

07 Apr, 2014

1 commit

  • ext4_end_bio() currently throws away the error that it receives. Chances
    are this is part of a spate of errors, one of which will end up getting
    the error returned to userspace somehow, but we shouldn't take that risk.
    Also print out the errno to aid in debug.

    Signed-off-by: Matthew Wilcox
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara
    Cc: stable@vger.kernel.org

    Matthew Wilcox
     

24 Nov, 2013

2 commits

  • Immutable biovecs are going to require an explicit iterator. To
    implement immutable bvecs, a later patch is going to add a bi_bvec_done
    member to this struct; for now, this patch effectively just renames
    things.

    Signed-off-by: Kent Overstreet
    Cc: Jens Axboe
    Cc: Geert Uytterhoeven
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Ed L. Cashin"
    Cc: Nick Piggin
    Cc: Lars Ellenberg
    Cc: Jiri Kosina
    Cc: Matthew Wilcox
    Cc: Geoff Levand
    Cc: Yehuda Sadeh
    Cc: Sage Weil
    Cc: Alex Elder
    Cc: ceph-devel@vger.kernel.org
    Cc: Joshua Morris
    Cc: Philip Kelleher
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Konrad Rzeszutek Wilk
    Cc: Jeremy Fitzhardinge
    Cc: Neil Brown
    Cc: Alasdair Kergon
    Cc: Mike Snitzer
    Cc: dm-devel@redhat.com
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: linux390@de.ibm.com
    Cc: Boaz Harrosh
    Cc: Benny Halevy
    Cc: "James E.J. Bottomley"
    Cc: Greg Kroah-Hartman
    Cc: "Nicholas A. Bellinger"
    Cc: Alexander Viro
    Cc: Chris Mason
    Cc: "Theodore Ts'o"
    Cc: Andreas Dilger
    Cc: Jaegeuk Kim
    Cc: Steven Whitehouse
    Cc: Dave Kleikamp
    Cc: Joern Engel
    Cc: Prasad Joshi
    Cc: Trond Myklebust
    Cc: KONISHI Ryusuke
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Ben Myers
    Cc: xfs@oss.sgi.com
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Len Brown
    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Cc: Herton Ronaldo Krzesinski
    Cc: Ben Hutchings
    Cc: Andrew Morton
    Cc: Guo Chao
    Cc: Tejun Heo
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Wei Yongjun
    Cc: "Roger Pau Monné"
    Cc: Jan Beulich
    Cc: Stefano Stabellini
    Cc: Ian Campbell
    Cc: Sebastian Ott
    Cc: Christian Borntraeger
    Cc: Minchan Kim
    Cc: Jiang Liu
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Joe Perches
    Cc: Peng Tao
    Cc: Andy Adamson
    Cc: fanchaoting
    Cc: Jie Liu
    Cc: Sunil Mushran
    Cc: "Martin K. Petersen"
    Cc: Namjae Jeon
    Cc: Pankaj Kumar
    Cc: Dan Magenheimer
    Cc: Mel Gorman 6

    Kent Overstreet
     
  • With immutable biovecs we don't want code accessing bi_io_vec directly -
    the uses this patch changes weren't incorrect since they all own the
    bio, but it makes the code harder to audit for no good reason - also,
    this will help with multipage bvecs later.

    Signed-off-by: Kent Overstreet
    Cc: Jens Axboe
    Cc: Alexander Viro
    Cc: Chris Mason
    Cc: Jaegeuk Kim
    Cc: Joern Engel
    Cc: Prasad Joshi
    Cc: Trond Myklebust

    Kent Overstreet
     

16 Oct, 2013

1 commit

  • It doesn't make sense to require io_end->handle when we are in
    nojournal mode. So update the assertion accordingly to avoid false
    warnings from ext4_add_complete_io().

    Reported-by: Eric Whitney
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     

04 Sep, 2013

1 commit

  • Add support to the core direct-io code to defer AIO completions to user
    context using a workqueue. This replaces opencoded and less efficient
    code in XFS and ext4 (we save a memory allocation for each direct IO)
    and will be needed to properly support O_(D)SYNC for AIO.

    The communication between the filesystem and the direct I/O code requires
    a new buffer head flag, which is a bit ugly but not avoidable until the
    direct I/O code stops abusing the buffer_head structure for communicating
    with the filesystems.

    Currently this creates a per-superblock unbound workqueue for these
    completions, which is taken from an earlier patch by Jan Kara. I'm
    not really convinced about this use and would prefer a "normal" global
    workqueue with a high concurrency limit, but this needs further discussion.

    JK: Fixed ext4 part, dynamic allocation of the workqueue.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Christoph Hellwig
     

12 Jul, 2013

1 commit

  • If there are a lot of outstanding buffered IOs when a device is
    taken offline (due to hardware errors etc), ext4_end_bio prints
    out a message for each failed logical block. While this is desirable,
    we see thousands of such lines being printed out before the
    serial console gets overwhelmed, causing ext4_end_bio() wait for
    the printk to complete.

    This in itself isn't a disaster, except for the detail that this
    function is being called with the queue lock held.
    This causes any other function in the block layer
    to spin on its spin_lock_irqsave while the serial console is
    draining. If NMI watchdog is enabled on this machine then it
    eventually comes along and shoots the machine in the head.

    The end result is that losing any one disk causes the machine to
    go down. This patch rate limits the printk to bandaid around the
    problem.

    Tested: xfstests
    Change-Id: I8ab5690dcf4f3a67e78be147d45e489fdf4a88d8
    Signed-off-by: Anatol Pomozov
    Signed-off-by: "Theodore Ts'o"

    Anatol Pomozov
     

11 Jul, 2013

1 commit

  • The following race can lead to ext4_evict_inode() seeing i_ioend_count
    > 0 and thus triggering a sanity check warning:

    CPU1 CPU2
    ext4_end_bio() ext4_evict_inode()
    ext4_finish_bio()
    end_page_writeback();
    truncate_inode_pages()
    evict page
    WARN_ON(i_ioend_count > 0);
    ext4_put_io_end_defer()
    ext4_release_io_end()
    dec i_ioend_count

    This is possible use-after-free bug since we decrement i_ioend_count in
    possibly released inode.

    Since i_ioend_count is used only for sanity checks one possible solution
    would be to just remove it but for now I'd like to keep those sanity
    checks to help debugging the new ext4 writeback code.

    This patch changes ext4_end_bio() to call ext4_put_io_end_defer() before
    ext4_finish_bio() in the shortcut case when unwritten extent conversion
    isn't needed. In that case we don't need the io_end so we are safe to
    drop it early.

    Reported-by: Guenter Roeck
    Tested-by: Guenter Roeck
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     

06 Jun, 2013

1 commit


05 Jun, 2013

8 commits

  • Now that we clear PageWriteback after extent conversion, there's no
    need to wait for io_end processing in ext4_evict_inode(). Running
    AIO/DIO keeps file reference until aio_complete() is called so
    ext4_evict_inode() cannot be called. For io_end structures resulting
    from buffered IO waiting is happening because we wait for
    PageWriteback in truncate_inode_pages().

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • We don't have to wait for extent conversion in ext4_punch_hole() as
    buffered IO for the punched range has been flushed and waited upon
    (thus all extent conversions for that range have completed). Also we
    wait for all DIO to finish using inode_dio_wait() so there cannot be
    any extent conversions pending due to direct IO.

    Also remove ext4_flush_unwritten_io() since it's unused now.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • Since PageWriteback bit is now cleared after extents are converted
    from unwritten to written ones, we have full exclusion of writeback
    path from truncate (truncate_inode_pages() waits for PageWriteback
    bits to get cleared on all invalidated pages). Exclusion from DIO
    path is achieved by inode_dio_wait() call in ext4_setattr(). So
    there's no need to wait for extent convertion in ext4_truncate()
    anymore.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • Currently PageWriteback bit gets cleared from put_io_page() called
    from ext4_end_bio(). This is somewhat inconvenient as extent tree is
    not fully updated at that time (unwritten extents are not marked as
    written) so we cannot read the data back yet. This design was
    dictated by lock ordering as we cannot start a transaction while
    PageWriteback bit is set (we could easily deadlock with
    ext4_da_writepages()). But now that we use transaction reservation
    for extent conversion, locking issues are solved and we can move
    PageWriteback bit clearing after extent conversion is done. As a
    result we can remove wait for unwritten extent conversion from
    ext4_sync_file() because it already implicitely happens through
    wait_on_page_writeback().

    We implement deferring of PageWriteback clearing by queueing completed
    bios to appropriate io_end and processing all the pages when io_end is
    going to be freed instead of at the moment ext4_io_end() is called.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • Now that we have extent conversions with reserved transaction, we have
    to prevent extent conversions without reserved transaction (from DIO
    code) to block these (as that would effectively void any transaction
    reservation we did). So split lists, work items, and work queues to
    reserved and unreserved parts.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • Later we would like to clear PageWriteback bit only after extent
    conversion from unwritten to written extents is performed. However it
    is not possible to start a transaction after PageWriteback is set
    because that violates lock ordering (and is easy to deadlock). So we
    have to reserve a transaction before locking pages and sending them
    for IO and later we use the transaction for extent conversion from
    ext4_end_io().

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • There isn't any need for setting BH_Uninit on buffers anymore. It was
    only used to signal we need to mark io_end as needing extent
    conversion in add_bh_to_extent() but now we can mark the io_end
    directly when mapping extent.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • There are two issues with current writeback path in ext4. For one we
    don't necessarily map complete pages when blocksize < pagesize and
    thus needn't do any writeback in one iteration. We always map some
    blocks though so we will eventually finish mapping the page. Just if
    writeback races with other operations on the file, forward progress is
    not really guaranteed. The second problem is that current code
    structure makes it hard to associate all the bios to some range of
    pages with one io_end structure so that unwritten extents can be
    converted after all the bios are finished. This will be especially
    difficult later when io_end will be associated with reserved
    transaction handle.

    We restructure the writeback path to a relatively simple loop which
    first prepares extent of pages, then maps one or more extents so that
    no page is partially mapped, and once page is fully mapped it is
    submitted for IO. We keep all the mapping and IO submission
    information in mpage_da_data structure to somewhat reduce stack usage.
    Resulting code is somewhat shorter than the old one and hopefully also
    easier to read.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     

04 Jun, 2013

1 commit

  • Change writeback path to create just one io_end structure for the
    extent to which we submit IO and share it among bios writing that
    extent. This prevents needless splitting and joining of unwritten
    extents when they cannot be submitted as a single bio.

    Bugs in ENOMEM handling found by Linux File System Verification project
    (linuxtesting.org) and fixed by Alexey Khoroshilov
    .

    CC: Alexey Khoroshilov
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     

15 May, 2013

1 commit

  • Pull ext4 update from Ted Ts'o:
    "Fixed regressions (two stability regressions and a performance
    regression) introduced during the 3.10-rc1 merge window.

    Also included is a bug fix relating to allocating blocks after
    resizing an ext3 file system when using the ext4 file system driver"

    * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    jbd,jbd2: fix oops in jbd2_journal_put_journal_head()
    ext4: revert "ext4: use io_end for multiple bios"
    ext4: limit group search loop for non-extent files
    ext4: fix fio regression

    Linus Torvalds
     

12 May, 2013

1 commit


08 May, 2013

1 commit

  • Faster kernel compiles by way of fewer unnecessary includes.

    [akpm@linux-foundation.org: fix fallout]
    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     

12 Apr, 2013

3 commits

  • Currently noone cleared buffer_uninit flag. This results in writeback
    needlessly marking io_end as needing extent conversion scanning extent
    tree for extents to convert. So clear the buffer_uninit flag once the
    buffer is submitted for IO and the flag is transformed into
    EXT4_IO_END_UNWRITTEN flag.

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Zheng Liu

    Jan Kara
     
  • Change writeback path to create just one io_end structure for the
    extent to which we submit IO and share it among bios writing that
    extent. This prevents needless splitting and joining of unwritten
    extents when they cannot be submitted as a single bio.

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Dmitry Monakhov
    Reviewed-by: Zheng Liu

    Jan Kara
     
  • So far ext4_bio_write_page() attached all the pages to ext4_io_end
    structure. This makes that structure pretty heavy (1 KB for pointers
    + 16 bytes per page attached to the bio). Also later we would like to
    share ext4_io_end structure among several bios in case IO to a single
    extent needs to be split among several bios and pointing to pages from
    ext4_io_end makes this complex.

    We remove page pointers from ext4_io_end and use pointers from bio
    itself instead. This isn't as easy when blocksize < pagesize because
    then we can have several bios in flight for a single page and we have
    to be careful when to call end_page_writeback(). However this is a
    known problem already solved by block_write_full_page() /
    end_buffer_async_write() so we mimic its behavior here. We mark
    buffers going to disk with BH_Async_Write flag and in
    ext4_bio_end_io() we check whether there are any buffers with
    BH_Async_Write flag left. If there are not, we can call
    end_page_writeback().

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Dmitry Monakhov
    Reviewed-by: Zheng Liu

    Jan Kara
     

20 Mar, 2013

1 commit

  • Commit 84c17543ab56 (ext4: move work from io_end to inode) triggered a
    regression when running xfstest #270 when the file system is mounted
    with dioread_nolock.

    The problem is that after ext4_evict_inode() calls ext4_ioend_wait(),
    this guarantees that last io_end structure has been freed, but it does
    not guarantee that the workqueue structure, which was moved into the
    inode by commit 84c17543ab56, is actually finished. Once
    ext4_flush_completed_IO() calls ext4_free_io_end() on CPU #1, this
    will allow ext4_ioend_wait() to return on CPU #2, at which point the
    evict_inode() codepath can race against the workqueue code on CPU #1
    accessing EXT4_I(inode)->i_unwritten_work to find the next item of
    work to do.

    Fix this by calling cancel_work_sync() in ext4_ioend_wait(), which
    will be renamed ext4_ioend_shutdown(), since it is only used by
    ext4_evict_inode(). Also, move the call to ext4_ioend_shutdown()
    until after truncate_inode_pages() and filemap_write_and_wait() are
    called, to make sure all dirty pages have been written back and
    flushed from the page cache first.

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] cwq_activate_delayed_work+0x3b/0x7e
    *pdpt = 0000000030bc3001 *pde = 0000000000000000
    Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
    Modules linked in:
    Pid: 6, comm: kworker/u:0 Not tainted 3.8.0-rc3-00013-g84c1754-dirty #91 Bochs Bochs
    EIP: 0060:[] EFLAGS: 00010046 CPU: 0
    EIP is at cwq_activate_delayed_work+0x3b/0x7e
    EAX: 00000000 EBX: 00000000 ECX: f505fe54 EDX: 00000000
    ESI: ed5b697c EDI: 00000006 EBP: f64b7e8c ESP: f64b7e84
    DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
    CR0: 8005003b CR2: 00000000 CR3: 30bc2000 CR4: 000006f0
    DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
    DR6: ffff0ff0 DR7: 00000400
    Process kworker/u:0 (pid: 6, ti=f64b6000 task=f64b4160 task.ti=f64b6000)
    Stack:
    f505fe00 00000006 f64b7e9c c01de3d7 f6435540 00000003 f64b7efc c01def1d
    f6435540 00000002 00000000 0000008a c16d0808 c040a10b c16d07d8 c16d08b0
    f505fe00 c16d0780 00000000 00000000 ee153df4 c1ce4a30 c17d0e30 00000000
    Call Trace:
    [] cwq_dec_nr_in_flight+0x71/0xfb
    [] process_one_work+0x5d8/0x637
    [] ? ext4_end_bio+0x300/0x300
    [] worker_thread+0x249/0x3ef
    [] kthread+0xd8/0xeb
    [] ? manage_workers+0x4bb/0x4bb
    [] ? trace_hardirqs_on+0x27/0x37
    [] ret_from_kernel_thread+0x1b/0x28
    [] ? __init_kthread_worker+0x71/0x71
    Code: 01 83 15 ac ff 6c c1 00 31 db 89 c6 8b 00 a8 04 74 12 89 c3 30 db 83 05 b0 ff 6c c1 01 83 15 b4 ff 6c c1 00 89 f0 e8 42 ff ff ff 13 89 f0 83 05 b8 ff 6c c1
    6c c1 00 31 c9 83
    EIP: [] cwq_activate_delayed_work+0x3b/0x7e SS:ESP 0068:f64b7e84
    CR2: 0000000000000000
    ---[ end trace a1923229da53d8a4 ]---

    Signed-off-by: "Theodore Ts'o"
    Cc: Jan Kara

    Theodore Ts'o
     

30 Jan, 2013

1 commit

  • Running AIO is pinning inode in memory using file reference. Once AIO
    is completed using aio_complete(), file reference is put and inode can
    be freed from memory. So we have to be sure that calling aio_complete()
    is the last thing we do with the inode.

    CC: stable@vger.kernel.org
    Reviewed-by: Carlos Maiolino
    Acked-by: Jeff Moyer
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     

29 Jan, 2013

2 commits

  • Remove unused variable flags from dump_completed_IO(). The code is
    only exercised when EXT4FS_DEBUG is defined.

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Zheng Liu

    Lukas Czerner
     
  • So far ext4_bio_writepage() unconditionally cleared dirty bit on all
    buffers underlying the page. That implicitely assumes we can write all
    buffers. So far that is true because callers call into
    ext4_bio_writepage() make sure all buffers in the page are mapped but:

    a) it's a data corruption bug waiting to happen
    b) in data=ordered mode when blocksize < pagesize we do need to write
    pages that may have only some of dirty buffers mapped.

    So change ext4_bio_writepage() to skip buffers that cannot be written without
    clearing their dirty bit.

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     

28 Jan, 2013

4 commits

  • The function splices i_completed_io_list to its private list
    first. From that moment on we don't need any lock for working with
    io_end structures because all io_end structure on the list are only
    our own. So we can remove the other two lists in the function and free
    io_end immediately after we are done with it.

    CC: Dmitry Monakhov
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • It does not make much sense to have struct work in ext4_io_end_t
    because we always use it for only one ext4_io_end_t per inode (the
    first one in the i_completed_io list). So just move the structure to
    inode itself. This also allows for a small simplification in
    processing io_end structures.

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • When we cannot write a page we should use redirty_page_for_writepage()
    instead of plain set_page_dirty(). That tells writeback code we have
    problems, redirties only the page (redirtying buffers is not needed),
    and updates mm accounting of failed page writes.

    Also move clearing of buffer dirty flag after io_submit_add_bh(). At that
    moment we are sure buffer will be going to disk.

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • Currently we sometimes used block_write_full_page() and sometimes
    ext4_bio_write_page() for writeback (depending on mount options and call
    path). Let's always use ext4_bio_write_page() to simplify things a bit.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     

29 Nov, 2012

1 commit

  • Previously, ext4_extents.h was being included at the end of ext4.h,
    which was bad for a number of reasons: (a) it was not being included
    in the expected place, and (b) it caused the header to be included
    multiple times. There were #ifdef's to prevent this from causing any
    problems, but it still was unnecessary.

    By moving the function declarations that were in ext4_extents.h to
    ext4.h, which is standard practice for where the function declarations
    for the rest of ext4.h can be found, we can remove ext4_extents.h from
    being included in ext4.h at all, and then we can only include
    ext4_extents.h where it is needed in ext4's source files.

    It should be possible to move a few more things into ext4.h, and
    further reduce the number of source files that need to #include
    ext4_extents.h, but that's a cleanup for another day.

    Reported-by: Sachin Kamat
    Reported-by: Wei Yongjun
    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

09 Nov, 2012

1 commit


05 Oct, 2012

1 commit

  • BUG #1) All places where we call ext4_flush_completed_IO are broken
    because buffered io and DIO/AIO goes through three stages
    1) submitted io,
    2) completed io (in i_completed_io_list) conversion pended
    3) finished io (conversion done)
    And by calling ext4_flush_completed_IO we will flush only
    requests which were in (2) stage, which is wrong because:
    1) punch_hole and truncate _must_ wait for all outstanding unwritten io
    regardless to it's state.
    2) fsync and nolock_dio_read should also wait because there is
    a time window between end_page_writeback() and ext4_add_complete_io()
    As result integrity fsync is broken in case of buffered write
    to fallocated region:
    fsync blkdev_completion
    ->filemap_write_and_wait_range
    ->ext4_end_bio
    ->end_page_writeback
    ext4_flush_completed_IO
    sees empty i_completed_io_list but pended
    conversion still exist
    ->ext4_add_complete_io

    BUG #2) Race window becomes wider due to the 'ext4: completed_io
    locking cleanup V4' patch series

    This patch make following changes:
    1) ext4_flush_completed_io() now first try to flush completed io and when
    wait for any outstanding unwritten io via ext4_unwritten_wait()
    2) Rename function to more appropriate name.
    3) Assert that all callers of ext4_flush_unwritten_io should hold i_mutex to
    prevent endless wait

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara

    Dmitry Monakhov
     

29 Sep, 2012

2 commits

  • Current unwritten extent conversion state-machine is very fuzzy.
    - For unknown reason it performs conversion under i_mutex. What for?
    My diagnosis:
    We already protect extent tree with i_data_sem, truncate and punch_hole
    should wait for DIO, so the only data we have to protect is end_io->flags
    modification, but only flush_completed_IO and end_io_work modified this
    flags and we can serialize them via i_completed_io_lock.

    Currently all these games with mutex_trylock result in the following deadlock
    truncate: kworker:
    ext4_setattr ext4_end_io_work
    mutex_lock(i_mutex)
    inode_dio_wait(inode) ->BLOCK
    DEADLOCKflags modification
    is protected by ei->ext4_complete_io_lock

    Full list of changes:
    - Move all completion end_io related routines to page-io.c in order to improve
    logic locality
    - Move open coded logic from various xx_end_xx routines to ext4_add_complete_io()
    - remove EXT4_IO_END_FSYNC
    - Improve SMP scalability by removing useless i_mutex which does not
    protect io->flags anymore.
    - Reduce lock contention on i_completed_io_lock by optimizing list walk.
    - Rename ext4_end_io_nolock to end4_end_io and make it static
    - Check flush completion status to ext4_ext_punch_hole(). Because it is
    not good idea to punch blocks from corrupted inode.

    Changes since V3 (in request to Jan's comments):
    Fall back to active flush_completed_IO() approach in order to prevent
    performance issues with nolocked DIO reads.
    Changes since V2:
    Fix use-after-free caused by race truncate vs end_io_work

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     
  • ext4_set_io_unwritten_flag() will increment i_unwritten counter, so
    once we mark end_io with EXT4_END_IO_UNWRITTEN we have to revert it back
    on error path.

    - add missed error checks to prevent counter leakage
    - ext4_end_io_nolock() will clear EXT4_END_IO_UNWRITTEN flag to signal
    that conversion finished.
    - add BUG_ON to ext4_free_end_io() to prevent similar leakage in future.

    Visible effect of this bug is that unaligned aio_stress may deadlock

    Reviewed-by: Jan Kara
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov