22 Aug, 2011

1 commit


28 Jul, 2011

2 commits

  • When someone writes to an inode, readers accessing the same inode via
    ocfs2_readpage() just busyloop trying to get ip_alloc_sem because
    do_generic_file_read() looks up the page again and retries ->readpage()
    when previous attempt failed with AOP_TRUNCATED_PAGE. When there are enough
    readers, they can occupy all CPUs and in non-preempt kernel the system is
    deadlocked because writer holding ip_alloc_sem is never run to release the
    semaphore. Fix the problem by making reader block on ip_alloc_sem to break
    the busy loop.

    Signed-off-by: Jan Kara
    Signed-off-by: Joel Becker

    Jan Kara
     
  • Fix a corruption that can happen when we have (two or more) outstanding
    aio's to an overlapping unaligned region. Ext4
    (e9e3bcecf44c04b9e6b505fd8e2eb9cea58fb94d) and xfs recently had to fix
    similar issues.

    In our case what happens is that we can have an outstanding aio on a region
    and if a write comes in with some bytes overlapping the original aio we may
    decide to read that region into a page before continuing (typically because
    of buffered-io fallback). Since we have no ordering guarantees with the
    aio, we can read stale or bad data into the page and then write it back out.

    If the i/o is page and block aligned, then we avoid this issue as there
    won't be any need to read data from disk.

    I took the same approach as Eric in the ext4 patch and introduced some
    serialization of unaligned async direct i/o. I don't expect this to have an
    effect on the most common cases of AIO. Unaligned aio will be slower
    though, but that's far more acceptable than data corruption.

    Signed-off-by: Mark Fasheh
    Signed-off-by: Joel Becker

    Mark Fasheh
     

25 Jul, 2011

1 commit

  • This patch address two shortcomings in ocfs2_page_mkwrite():
    1. Makes the function return better VM_FAULT_* errors.
    2. It handles a error that is triggered when a page is dropped from the mapping
    due to memory pressure. This patch locks the page to prevent that.

    [Patch was cleaned up by Sunil Mushran.]

    Signed-off-by: Wengang Wang
    Signed-off-by: Sunil Mushran

    Wengang Wang
     

21 Jul, 2011

3 commits

  • For filesystems that delay their end_io processing we should keep our
    i_dio_count until the the processing is done. Enable this by moving
    the inode_dio_done call to the end_io handler if one exist. Note that
    the actual move to the workqueue for ext4 and XFS is not done in
    this patch yet, but left to the filesystem maintainers. At least
    for XFS it's not needed yet either as XFS has an internal equivalent
    to i_dio_count.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING.
    This these filesystems to also protect truncate against direct I/O requests
    by using common code. Right now the only non-DIO_LOCKING filesystem that
    appears to do so is XFS, which uses an opencoded variant of the i_dio_count
    scheme.

    Behaviour doesn't change for filesystems never calling inode_dio_wait.
    For ext4 behaviour changes when using the dioread_nonlock option, which
    previously was missing any protection between truncate and direct I/O reads.
    For ocfs2 that handcrafted i_dio_count manipulations are replaced with
    the common code now enable.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • i_alloc_sem is a rather special rw_semaphore. It's the last one that may
    be released by a non-owner, and it's write side is always mirrored by
    real exclusion. It's intended use it to wait for all pending direct I/O
    requests to finish before starting a truncate.

    Replace it with a hand-grown construct:

    - exclusion for truncates is already guaranteed by i_mutex, so it can
    simply fall way
    - the reader side is replaced by an i_dio_count member in struct inode
    that counts the number of pending direct I/O requests. Truncate can't
    proceed as long as it's non-zero
    - when i_dio_count reaches non-zero we wake up a pending truncate using
    wake_up_bit on a new bit in i_flags
    - new references to i_dio_count can't appear while we are waiting for
    it to read zero because the direct I/O count always needs i_mutex
    (or an equivalent like XFS's i_iolock) for starting a new operation.

    This scheme is much simpler, and saves the space of a spinlock_t and a
    struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
    system).

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

29 Mar, 2011

2 commits

  • * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (39 commits)
    Treat writes as new when holes span across page boundaries
    fs,ocfs2: Move o2net_get_func_run_time under CONFIG_OCFS2_FS_STATS.
    ocfs2/dlm: Move kmalloc() outside the spinlock
    ocfs2: Make the left masklogs compat.
    ocfs2: Remove masklog ML_AIO.
    ocfs2: Remove masklog ML_UPTODATE.
    ocfs2: Remove masklog ML_BH_IO.
    ocfs2: Remove masklog ML_JOURNAL.
    ocfs2: Remove masklog ML_EXPORT.
    ocfs2: Remove masklog ML_DCACHE.
    ocfs2: Remove masklog ML_NAMEI.
    ocfs2: Remove mlog(0) from fs/ocfs2/dir.c
    ocfs2: remove NAMEI from symlink.c
    ocfs2: Remove masklog ML_QUOTA.
    ocfs2: Remove mlog(0) from quota_local.c.
    ocfs2: Remove masklog ML_RESERVATIONS.
    ocfs2: Remove masklog ML_XATTR.
    ocfs2: Remove masklog ML_SUPER.
    ocfs2: Remove mlog(0) from fs/ocfs2/heartbeat.c
    ocfs2: Remove mlog(0) from fs/ocfs2/slot_map.c
    ...

    Fix up trivial conflict in fs/ocfs2/super.c

    Linus Torvalds
     
  • When a hole spans across page boundaries, the next write forces
    a read of the block. This could end up reading existing garbage
    data from the disk in ocfs2_map_page_blocks. This leads to
    non-zero holes. In order to avoid this, mark the writes as new
    when the holes span across page boundaries.

    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: jlbec

    Goldwyn Rodrigues
     

10 Mar, 2011

1 commit

  • Code has been converted over to the new explicit on-stack plugging,
    and delay users have been converted to use the new API for that.
    So lets kill off the old plugging along with aops->sync_page().

    Signed-off-by: Jens Axboe

    Jens Axboe
     

07 Mar, 2011

1 commit

  • mlog_exit is used to record the exit status of a function.
    But because it is added in so many functions, if we enable it,
    the system logs get filled up quickly and cause too much I/O.
    So actually no one can open it for a production system or even
    for a test.

    This patch just try to remove it or change it. So:
    1. if all the error paths already use mlog_errno, it is just removed.
    Otherwise, it will be replaced by mlog_errno.
    2. if it is used to print some return value, it is replaced with
    mlog(0,...).
    mlog_exit_ptr is changed to mlog(0.
    All those mlog(0,...) will be replaced with trace events later.

    Signed-off-by: Tao Ma

    Tao Ma
     

22 Feb, 2011

1 commit


21 Feb, 2011

1 commit

  • ENTRY is used to record the entry of a function.
    But because it is added in so many functions, if we enable it,
    the system logs get filled up quickly and cause too much I/O.
    So actually no one can open it for a production system or even
    for a test.

    So for mlog_entry_void, we just remove it.
    for mlog_entry(...), we replace it with mlog(0,...), and they
    will be replace by trace event later.

    Signed-off-by: Tao Ma

    Tao Ma
     

12 Jan, 2011

1 commit

  • * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (22 commits)
    MAINTAINERS: Update Joel Becker's email address
    ocfs2: Remove unused truncate function from alloc.c
    ocfs2/cluster: dereferencing before checking in nst_seq_show()
    ocfs2: fix build for OCFS2_FS_STATS not enabled
    ocfs2/cluster: Show o2net timing statistics
    ocfs2/cluster: Track process message timing stats for each socket
    ocfs2/cluster: Track send message timing stats for each socket
    ocfs2/cluster: Use ktime instead of timeval in struct o2net_sock_container
    ocfs2/cluster: Replace timeval with ktime in struct o2net_send_tracking
    ocfs2: Add DEBUG_FS dependency
    ocfs2/dlm: Hard code the values for enums
    ocfs2/dlm: Minor cleanup
    ocfs2/dlm: Cleanup dlmdebug.c
    ocfs2: Release buffer_head in case of error in ocfs2_double_lock.
    ocfs2/cluster: Pin the local node when o2hb thread starts
    ocfs2/cluster: Show pin state for each o2hb region
    ocfs2/cluster: Pin/unpin o2hb regions
    ocfs2/cluster: Remove dropped region from o2hb quorum region bitmap
    ocfs2/cluster: Pin the remote node item in configfs
    ocfs2/dlm: make existing convertion precedent over new lock
    ...

    Linus Torvalds
     

16 Dec, 2010

1 commit

  • Recently, one of our colleagues meet with a problem that if we
    write/delete a 32mb files repeatly, we will get an ENOSPC in
    the end. And the corresponding bug is 1288.
    http://oss.oracle.com/bugzilla/show_bug.cgi?id=1288

    The real problem is that although we have freed the clusters,
    they are in truncate log and they will be summed up so that
    we can free them once in a whole.

    So this patch just try to resolve it. In case we see -ENOSPC
    in ocfs2_write_begin_no_lock, we will check whether the truncate
    log has enough clusters for our need, if yes, we will try to
    flush the truncate log at that point and try again. This method
    is inspired by Mark Fasheh . Thanks.

    Cc: Mark Fasheh
    Signed-off-by: Tao Ma
    Signed-off-by: Joel Becker

    Tao Ma
     

10 Dec, 2010

1 commit

  • Due to newly-introduced 'coherency=full' O_DIRECT writes also takes the EX
    rw_lock like buffered writes did(rw_level == 1), it turns out messing the
    usage of 'level' in ocfs2_dio_end_io() up, which caused i_alloc_sem being
    failed to get up_read'd correctly.

    This patch tries to teach ocfs2_dio_end_io to understand well on all locking
    stuffs by explicitly introducing a new bit for i_alloc_sem in iocb's private
    data, just like what we did for rw_lock.

    Signed-off-by: Tristan Ye
    Signed-off-by: Joel Becker

    Tristan Ye
     

26 Oct, 2010

1 commit

  • __block_write_begin and block_prepare_write are identical except for slightly
    different calling conventions. Convert all callers to the __block_write_begin
    calling conventions and drop block_prepare_write.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

10 Sep, 2010

2 commits


12 Aug, 2010

2 commits


10 Aug, 2010

1 commit

  • Move the call to vmtruncate to get rid of accessive blocks to the callers
    in prepearation of the new truncate calling sequence. This was only done
    for DIO_LOCKING filesystems, so the __blockdev_direct_IO_newtrunc variant
    was not needed anyway. Get rid of blockdev_direct_IO_no_locking and
    its _newtrunc variant while at it as just opencoding the two additional
    paramters is shorted than the name suffix.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

27 Jul, 2010

1 commit

  • Filesystems with unwritten extent support must not complete an AIO request
    until the transaction to convert the extent has been commited. That means
    the aio_complete calls needs to be moved into the ->end_io callback so
    that the filesystem can control when to call it exactly.

    This makes a bit of a mess out of dio_complete and the ->end_io callback
    prototype even more complicated.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

13 Jul, 2010

1 commit

  • When ocfs2 fills a hole, it does so by allocating clusters. When a
    cluster is larger than the write, ocfs2 must zero the portions of the
    cluster outside of the write. If the clustersize is smaller than a
    pagecache page, this is handled by the normal pagecache mechanisms, but
    when the clustersize is larger than a page, ocfs2's write code will zero
    the pages adjacent to the write. This makes sure the entire cluster is
    zeroed correctly.

    Currently ocfs2 behaves exactly the same when writing past i_size.
    However, this means ocfs2 is writing zeroed pages for portions of a new
    cluster that are beyond i_size. The page writeback code isn't expecting
    this. It treats all pages past the one containing i_size as left behind
    due to a previous truncate operation.

    Thankfully, ocfs2 calculates the number of pages it will be working on
    up front. The rest of the write code merely honors the original
    calculation. We can simply trim the number of pages to only cover the
    actual file data.

    Signed-off-by: Joel Becker
    Cc: stable@kernel.org

    Joel Becker
     

09 Jul, 2010

2 commits

  • ocfs2's allocation unit is the cluster. This can be larger than a block
    or even a memory page. This means that a file may have many blocks in
    its last extent that are beyond the block containing i_size. There also
    may be more unwritten extents after that.

    When ocfs2 grows a file, it zeros the entire cluster in order to ensure
    future i_size growth will see cleared blocks. Unfortunately,
    block_write_full_page() drops the pages past i_size. This means that
    ocfs2 is actually leaking garbage data into the tail end of that last
    cluster. This is a bug.

    We adjust ocfs2_write_begin_nolock() and ocfs2_extend_file() to detect
    when a write or truncate is past i_size. They will use
    ocfs2_zero_extend() to ensure the data is properly zeroed.

    Older versions of ocfs2_zero_extend() simply zeroed every block between
    i_size and the zeroing position. This presumes three things:

    1) There is allocation for all of these blocks.
    2) The extents are not unwritten.
    3) The extents are not refcounted.

    (1) and (2) hold true for non-sparse filesystems, which used to be the
    only users of ocfs2_zero_extend(). (3) is another bug.

    Since we're now using ocfs2_zero_extend() for sparse filesystems as
    well, we teach ocfs2_zero_extend() to check every extent between
    i_size and the zeroing position. If the extent is unwritten, it is
    ignored. If it is refcounted, it is CoWed. Then it is zeroed.

    Signed-off-by: Joel Becker
    Cc: stable@kernel.org

    Joel Becker
     
  • ocfs2_zero_extend() does its zeroing block by block, but it calls a
    function named ocfs2_write_zero_page(). Let's have
    ocfs2_write_zero_page() handle the page level. From
    ocfs2_zero_extend()'s perspective, it is now page-at-a-time.

    Signed-off-by: Joel Becker
    Cc: stable@kernel.org

    Joel Becker
     

06 May, 2010

1 commit


06 Mar, 2010

1 commit

  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (33 commits)
    quota: stop using QUOTA_OK / NO_QUOTA
    dquot: cleanup dquot initialize routine
    dquot: move dquot initialization responsibility into the filesystem
    dquot: cleanup dquot drop routine
    dquot: move dquot drop responsibility into the filesystem
    dquot: cleanup dquot transfer routine
    dquot: move dquot transfer responsibility into the filesystem
    dquot: cleanup inode allocation / freeing routines
    dquot: cleanup space allocation / freeing routines
    ext3: add writepage sanity checks
    ext3: Truncate allocated blocks if direct IO write fails to update i_size
    quota: Properly invalidate caches even for filesystems with blocksize < pagesize
    quota: generalize quota transfer interface
    quota: sb_quota state flags cleanup
    jbd: Delay discarding buffers in journal_unmap_buffer
    ext3: quota_write cross block boundary behaviour
    quota: drop permission checks from xfs_fs_set_xstate/xfs_fs_set_xquota
    quota: split out compat_sys_quotactl support from quota.c
    quota: split out netlink notification support from quota.c
    quota: remove invalid optimization from quota_sync_all
    ...

    Fixed trivial conflicts in fs/namei.c and fs/ufs/inode.c

    Linus Torvalds
     

05 Mar, 2010

1 commit

  • Get rid of the alloc_space, free_space, reserve_space, claim_space and
    release_rsv dquot operations - they are always called from the filesystem
    and if a filesystem really needs their own (which none currently does)
    it can just call into it's own routine directly.

    Move shared logic into the common __dquot_alloc_space,
    dquot_claim_space_nodirty and __dquot_free_space low-level methods,
    and rationalize the wrappers around it to move as much as possible
    code into the common block for CONFIG_QUOTA vs not. Also rename
    all these helpers to be named dquot_* instead of vfs_dq_*.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     

27 Feb, 2010

1 commit


26 Jan, 2010

1 commit


16 Dec, 2009

1 commit

  • Currently the locking in blockdev_direct_IO is a mess, we have three
    different locking types and very confusing checks for some of them. The
    most complicated one is DIO_OWN_LOCKING for reads, which happens to not
    actually be used.

    This patch gets rid of the DIO_OWN_LOCKING - as mentioned above the read
    case is unused anyway, and the write side is almost identical to
    DIO_NO_LOCKING. The difference is that DIO_NO_LOCKING always sets the
    create argument for the get_blocks callback to zero, but we can easily
    move that to the actual get_blocks callbacks. There are four users of the
    DIO_NO_LOCKING mode: gfs already ignores the create argument and thus is
    fine with the new version, ocfs2 only errors out if create were ever set,
    and we can remove this dead code now, the block device code only ever uses
    create for an error message if we are fully beyond the device which can
    never happen, and last but not least XFS will need the new behavour for
    writes.

    Now we can replace the lock_type variable with a flags one, where no flag
    means the DIO_NO_LOCKING behaviour and DIO_LOCKING is kept as the first
    flag. Separate out the check for not allowing to fill holes into a
    separate flag, although for now both flags always get set at the same
    time.

    Also revamp the documentation of the locking scheme to actually make
    sense.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Badari Pulavarty
    Cc: Jeff Moyer
    Cc: Jens Axboe
    Cc: Zach Brown
    Cc: Al Viro
    Cc: Alex Elder
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

24 Sep, 2009

1 commit

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     

23 Sep, 2009

4 commits

  • In ocfs2_file_aio_write, we will prevent direct io if
    we find that we are appending(changing i_size) and call
    generic_file_aio_write_nolock. But actually O_DIRECT flag
    is there and this function will call generic_file_direct_write
    eventually which will update i_size and leave di->i_size
    alone. The bug is
    http://oss.oracle.com/bugzilla/show_bug.cgi?id=1173.

    So this patch let ocfs2_direct_IO returns 0 directly if we
    are appending so that buffered write will be called and
    di->i_size get updated successfully. And this is also
    what we want in ocfs2_file_aio_write.

    Signed-off-by: Tao Ma
    Signed-off-by: Joel Becker

    Tao Ma
     
  • When we truncate a file to a specific size which resides in a reflinked
    cluster, we need to CoW it since ocfs2_zero_range_for_truncate will
    zero the space after the size(just another type of write).

    So we add a "max_cpos" in ocfs2_refcount_cow so that it will stop when
    it hit the max cluster offset.

    Signed-off-by: Tao Ma

    Tao Ma
     
  • When we use mmap, we CoW the refcountd clusters in
    ocfs2_write_begin_nolock. While for normal file
    io(including directio), we do CoW in
    ocfs2_prepare_inode_for_write.

    Signed-off-by: Tao Ma

    Tao Ma
     
  • This patch try CoW support for a refcounted record.

    the whole process will be:
    1. Calculate how many clusters we need to CoW and where we start.
    Extents that are not completely encompassed by the write will
    be broken on 1MB boundaries.
    2. Do CoW for the clusters with the help of page cache.
    3. Change the b-tree structure with the new allocated clusters.

    Signed-off-by: Tao Ma

    Tao Ma
     

16 Sep, 2009

1 commit

  • Enable removing of corrupted pages through truncation
    for a bunch of file systems: ext*, xfs, gfs2, ocfs2, ntfs
    These should cover most server needs.

    I chose the set of migration aware file systems for this
    for now, assuming they have been especially audited.
    But in general it should be safe for all file systems
    on the data area that support read/write and truncate.

    Caveat: the hardware error handler does not take i_mutex
    for now before calling the truncate function. Is that ok?

    Cc: tytso@mit.edu
    Cc: hch@infradead.org
    Cc: mfasheh@suse.com
    Cc: aia21@cantab.net
    Cc: hugh.dickins@tiscali.co.uk
    Cc: swhiteho@redhat.com
    Signed-off-by: Andi Kleen

    Andi Kleen
     

05 Sep, 2009

2 commits