26 Mar, 2016

2 commits

  • In the current implementation of unaligned aio+dio, lock order behave as
    follow:

    in user process context:
    -> call io_submit()
    -> get i_mutex
    get ip_unaligned_aio
    -> submit direct io to block device
    -> release i_mutex
    -> io_submit() return

    in dio work queue context(the work queue is created in __blockdev_direct_IO):
    -> release ip_unaligned_aio
    get i_mutex
    -> clear unwritten flag & change i_size
    -> release i_mutex

    There is a limitation to the thread number of dio work queue. 256 at
    default. If all 256 thread are in the above 'window2' stage, and there
    is a user process in the 'window1' stage, the system will became
    deadlock. Since the user process hold i_mutex to wait ip_unaligned_aio
    lock, while there is a direct bio hold ip_unaligned_aio mutex who is
    waiting for a dio work queue thread to be schedule. But all the dio
    work queue thread is waiting for i_mutex lock in 'window2'.

    This case only happened in a test which send a large number(more than
    256) of aio at one io_submit() call.

    My design is to remove ip_unaligned_aio lock. Change it to a sync io
    instead. Just like ip_unaligned_aio lock, serialize the unaligned aio
    dio.

    [akpm@linux-foundation.org: remove OCFS2_IOCB_UNALIGNED_IO, per Junxiao Bi]
    Signed-off-by: Ryan Ding
    Reviewed-by: Junxiao Bi
    Cc: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ryan Ding
     
  • Patchset: fix ocfs2 direct io code patch to support sparse file and data
    ordering semantics

    The idea is to use buffer io(more precisely use the interface
    ocfs2_write_begin_nolock & ocfs2_write_end_nolock) to do the zero work
    beyond block size. And clear UNWRITTEN flag until direct io data has
    been written to disk, which can prevent data corruption when system
    crashed during direct write.

    And we will also archive a better performance: eg. dd direct write new
    file with block size 4KB: before this patchset:
    2.5 MB/s
    after this patchset:
    66.4 MB/s

    This patch (of 8):

    To support direct io in ocfs2_write_begin_nolock &
    ocfs2_write_end_nolock.

    Remove unused args filp & flags. Add new arg type. The type is one of
    buffer/direct/mmap. Indicate 3 way to perform write. buffer/mmap type
    has implemented. direct type will be implemented later.

    Signed-off-by: Ryan Ding
    Reviewed-by: Junxiao Bi
    Cc: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ryan Ding
     

25 Jun, 2015

1 commit

  • In ocfs2 direct read/write, OCFS2_IOCB_SEM lock type is used to protect
    inode->i_alloc_sem rw semaphore lock in the earlier kernel version.
    However, in the latest kernel, inode->i_alloc_sem rw semaphore lock is not
    used at all, so OCFS2_IOCB_SEM lock type needs to be removed.

    Signed-off-by: Weiwei Wang
    Cc: Mark Fasheh
    Cc: Joel Becker
    Reviewed-by: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    WeiWei Wang
     

26 Mar, 2015

1 commit


04 Apr, 2014

1 commit

  • There is a problem that waitqueue_active() may check stale data thus miss
    a wakeup of threads waiting on ip_unaligned_aio.

    The valid value of ip_unaligned_aio is only 0 and 1 so we can change it to
    be of type mutex thus the above prolem is avoid. Another benifit is that
    mutex which works as FIFO is fairer than wake_up_all().

    Signed-off-by: Wengang Wang
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wengang Wang
     

08 May, 2013

1 commit

  • Faster kernel compiles by way of fewer unnecessary includes.

    [akpm@linux-foundation.org: fix fallout]
    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     

28 Jul, 2011

1 commit

  • Fix a corruption that can happen when we have (two or more) outstanding
    aio's to an overlapping unaligned region. Ext4
    (e9e3bcecf44c04b9e6b505fd8e2eb9cea58fb94d) and xfs recently had to fix
    similar issues.

    In our case what happens is that we can have an outstanding aio on a region
    and if a write comes in with some bytes overlapping the original aio we may
    decide to read that region into a page before continuing (typically because
    of buffered-io fallback). Since we have no ordering guarantees with the
    aio, we can read stale or bad data into the page and then write it back out.

    If the i/o is page and block aligned, then we avoid this issue as there
    won't be any need to read data from disk.

    I took the same approach as Eric in the ext4 patch and introduced some
    serialization of unaligned async direct i/o. I don't expect this to have an
    effect on the most common cases of AIO. Unaligned aio will be slower
    though, but that's far more acceptable than data corruption.

    Signed-off-by: Mark Fasheh
    Signed-off-by: Joel Becker

    Mark Fasheh
     

31 Mar, 2011

1 commit


10 Dec, 2010

1 commit

  • Due to newly-introduced 'coherency=full' O_DIRECT writes also takes the EX
    rw_lock like buffered writes did(rw_level == 1), it turns out messing the
    usage of 'level' in ocfs2_dio_end_io() up, which caused i_alloc_sem being
    failed to get up_read'd correctly.

    This patch tries to teach ocfs2_dio_end_io to understand well on all locking
    stuffs by explicitly introducing a new bit for i_alloc_sem in iocb's private
    data, just like what we did for rw_lock.

    Signed-off-by: Tristan Ye
    Signed-off-by: Joel Becker

    Tristan Ye
     

26 Oct, 2010

1 commit

  • __block_write_begin and block_prepare_write are identical except for slightly
    different calling conventions. Convert all callers to the __block_write_begin
    calling conventions and drop block_prepare_write.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

12 Aug, 2010

1 commit


23 Sep, 2009

1 commit

  • This patch try CoW support for a refcounted record.

    the whole process will be:
    1. Calculate how many clusters we need to CoW and where we start.
    Extents that are not completely encompassed by the write will
    be broken on 1MB boundaries.
    2. Do CoW for the clusters with the help of page cache.
    3. Change the b-tree structure with the new allocated clusters.

    Signed-off-by: Tao Ma

    Tao Ma
     

17 Oct, 2007

1 commit

  • Plug ocfs2 into the ->write_begin and ->write_end aops.

    A bunch of custom code is now gone - the iovec iteration stuff during write
    and the ocfs2 splice write actor.

    Signed-off-by: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

13 Oct, 2007

2 commits

  • This fixes up write, truncate, mmap, and RESVSP/UNRESVP to understand inline
    inode data.

    For the most part, the changes to the core write code can be relied on to do
    the heavy lifting. Any code calling ocfs2_write_begin (including shared
    writeable mmap) can count on it doing the right thing with respect to
    growing inline data to an extent tree.

    Size reducing truncates, including UNRESVP can simply zero that portion of
    the inode block being removed. Size increasing truncatesm, including RESVP
    have to be a little bit smarter and grow the inode to an extent tree if
    necessary.

    Signed-off-by: Mark Fasheh
    Reviewed-by: Joel Becker

    Mark Fasheh
     
  • We'll want to reuse most of this when pushing inline data back out to an
    extent. Keeping this part as a seperate patch helps to keep the upcoming
    changes for write support uncluttered.

    The core portion of ocfs2_zero_cluster_pages() responsible for making sure a
    page is mapped and properly dirtied is abstracted out into it's own
    function, ocfs2_map_and_dirty_page(). Actual functionality doesn't change,
    though zeroing becomes optional.

    We also turn part of ocfs2_free_write_ctxt() into a common function for
    unlocking and freeing a page array. This operation is very common (and
    uniform) for Ocfs2 cluster sizes greater than page size, so it makes sense
    to keep the code in one place.

    Signed-off-by: Mark Fasheh
    Reviewed-by: Joel Becker

    Mark Fasheh
     

11 Jul, 2007

2 commits

  • Implement cluster consistent shared writeable mappings using the
    ->page_mkwrite() callback.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • Use some ideas from the new-aops patch series and turn
    ocfs2_buffered_write_cluster() into a 2 stage operation with the caller
    copying data in between. The code now understands multiple cluster writes as
    a result of having to deal with a full page write for greater than 4k pages.

    This sets us up to easily call into the write path during ->page_mkwrite().

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     

27 Apr, 2007

4 commits


02 Dec, 2006

1 commit


18 May, 2006

1 commit

  • We need to take a data lock around extends to protect the pages that
    ocfs2_zero_extend is going to be pulling into the page cache. Otherwise an
    extend on one node might populate the page cache with data pages that have
    no lock coverage.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     

04 Jan, 2006

1 commit