13 Apr, 2016

1 commit


11 Apr, 2016

1 commit

  • This reverts commit 1028b55bafb7611dda1d8fed2aeca16a436b7dff.

    It's broken: it makes ext4 return an error at an invalid point, causing
    the readdir wrappers to write the the position of the last successful
    directory entry into the position field, which means that the next
    readdir will now return that last successful entry _again_.

    You can only return fatal errors (that terminate the readdir directory
    walk) from within the filesystem readdir functions, the "normal" errors
    (that happen when the readdir buffer fills up, for example) happen in
    the iterorator where we know the position of the actual failing entry.

    I do have a very different patch that does the "signal_pending()"
    handling inside the iterator function where it is allowable, but while
    that one passes all the sanity checks, I screwed up something like four
    times while emailing it out, so I'm not going to commit it today.

    So my track record is not good enough, and the stars will have to align
    better before that one gets committed. And it would be good to get some
    review too, of course, since celestial alignments are always an iffy
    debugging model.

    IOW, let's just revert the commit that caused the problem for now.

    Reported-by: Greg Thelen
    Cc: Theodore Ts'o
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

08 Apr, 2016

1 commit

  • Pull ext4 bugfixes from Ted Ts'o:
    "These changes contains a fix for overlayfs interacting with some
    (badly behaved) dentry code in various file systems. These have been
    reviewed by Al and the respective file system mtinainers and are going
    through the ext4 tree for convenience.

    This also has a few ext4 encryption bug fixes that were discovered in
    Android testing (yes, we will need to get these sync'ed up with the
    fs/crypto code; I'll take care of that). It also has some bug fixes
    and a change to ignore the legacy quota options to allow for xfstests
    regression testing of ext4's internal quota feature and to be more
    consistent with how xfs handles this case"

    * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: ignore quota mount options if the quota feature is enabled
    ext4 crypto: fix some error handling
    ext4: avoid calling dquot_get_next_id() if quota is not enabled
    ext4: retry block allocation for failed DIO and DAX writes
    ext4: add lockdep annotations for i_data_sem
    ext4: allow readdir()'s of large empty directories to be interrupted
    btrfs: fix crash/invalid memory access on fsync when using overlayfs
    ext4 crypto: use dget_parent() in ext4_d_revalidate()
    ext4: use file_dentry()
    ext4: use dget_parent() in ext4_file_open()
    nfs: use file_dentry()
    fs: add file_dentry()
    ext4 crypto: don't let data integrity writebacks fail with ENOMEM
    ext4: check if in-inode xattr is corrupted in ext4_expand_extra_isize_ea()

    Linus Torvalds
     

05 Apr, 2016

2 commits

  • Mostly direct substitution with occasional adjustment or removing
    outdated comments.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

04 Apr, 2016

1 commit

  • Previously, ext4 would fail the mount if the file system had the quota
    feature enabled and quota mount options (used for the older quota
    setups) were present. This broke xfstests, since xfs silently ignores
    the usrquote and grpquota mount options if they are specified. This
    commit changes things so that we are consistent with xfs; having the
    mount options specified is harmless, so no sense break users by
    forbidding them.

    Cc: stable@vger.kernel.org
    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     

03 Apr, 2016

1 commit


02 Apr, 2016

1 commit


01 Apr, 2016

2 commits

  • Currently if block allocation for DIO or DAX write fails due to ENOSPC,
    we just returned it to userspace. However these ENOSPC errors can be
    transient because the transaction freeing blocks has not yet committed.
    This demonstrates as failures of generic/102 test when the filesystem is
    mounted with 'dax' mount option.

    Fix the problem by properly retrying the allocation in case of ENOSPC
    error in get blocks functions used for direct IO.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o
    Tested-by: Ross Zwisler

    Jan Kara
     
  • With the internal Quota feature, mke2fs creates empty quota inodes and
    quota usage tracking is enabled as soon as the file system is mounted.
    Since quotacheck is no longer preallocating all of the blocks in the
    quota inode that are likely needed to be written to, we are now seeing
    a lockdep false positive caused by needing to allocate a quota block
    from inside ext4_map_blocks(), while holding i_data_sem for a data
    inode. This results in this complaint:

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&ei->i_data_sem);
    lock(&s->s_dquot.dqio_mutex);
    lock(&ei->i_data_sem);
    lock(&s->s_dquot.dqio_mutex);

    Google-Bug-Id: 27907753

    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     

31 Mar, 2016

1 commit

  • If a directory has a large number of empty blocks, iterating over all
    of them can take a long time, leading to scheduler warnings and users
    getting irritated when they can't kill a process in the middle of one
    of these long-running readdir operations. Fix this by adding checks to
    ext4_readdir() and ext4_htree_fill_tree().

    Reported-by: Benjamin LaHaise
    Google-Bug-Id: 27880676
    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     

27 Mar, 2016

4 commits

  • This avoids potential problems caused by a race where the inode gets
    renamed out from its parent directory and the parent directory is
    deleted while ext4_d_revalidate() is running.

    Fixes: 28b4c263961c
    Reported-by: Al Viro
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     
  • EXT4 may be used as lower layer of overlayfs and accessing f_path.dentry
    can lead to a crash.

    Fix by replacing direct access of file->f_path.dentry with the
    file_dentry() accessor, which will always return a native object.

    Reported-by: Daniel Axtens
    Fixes: 4bacc9c9234c ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
    Fixes: ff978b09f973 ("ext4 crypto: move context consistency check to ext4_file_open()")
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Theodore Ts'o
    Cc: David Howells
    Cc: Al Viro
    Cc: # v4.5

    Miklos Szeredi
     
  • In f_op->open() lock on parent is not held, so there's no guarantee that
    parent dentry won't go away at any time.

    Even after this patch there's no guarantee that 'dir' will stay the parent
    of 'inode', but at least it won't be freed while being used.

    Fixes: ff978b09f973 ("ext4 crypto: move context consistency check to ext4_file_open()")
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Theodore Ts'o
    Cc: # v4.5

    Miklos Szeredi
     
  • We don't want the writeback triggered from the journal commit (in
    data=writeback mode) to cause the journal to abort due to
    generic_writepages() returning an ENOMEM error. In addition, if
    fsync() fails with ENOMEM, most applications will probably not do the
    right thing.

    So if we are doing a data integrity sync, and ext4_encrypt() returns
    ENOMEM, we will submit any queued I/O to date, and then retry the
    allocation using GFP_NOFAIL.

    Google-Bug-Id: 27641567

    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     

23 Mar, 2016

2 commits


22 Mar, 2016

2 commits

  • Pull UDF and quota updates from Jan Kara:
    "This contains a rewrite of UDF handling of filename encoding to fix
    remaining overflow issues from Andrew Gabbasov and quota changes to
    support new Q_[X]GETNEXTQUOTA quotactl for VFS quota formats"

    * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    quota: Fix possible GPF due to uninitialised pointers
    ext4: Make Q_GETNEXTQUOTA work for quota in hidden inodes
    quota: Forbid Q_GETQUOTA and Q_GETNEXTQUOTA for frozen filesystem
    quota: Fix possible races during quota loading
    ocfs2: Implement get_next_id()
    quota_v2: Implement get_next_id() for V2 quota format
    quota: Add support for ->get_nextdqblk() for VFS quota
    udf: Merge linux specific translation into CS0 conversion function
    udf: Remove struct ustr as non-needed intermediate storage
    udf: Use separate buffer for copying split names
    udf: Adjust UDF_NAME_LEN to better reflect actual restrictions
    udf: Join functions for UTF8 and NLS conversions
    udf: Parameterize output length in udf_put_filename
    quota: Allow Q_GETQUOTA for frozen filesystem
    quota: Fixup comments about return value of Q_[X]GETNEXTQUOTA

    Linus Torvalds
     
  • Pull xfs updates from Dave Chinner:
    "There's quite a lot in this request, and there's some cross-over with
    ext4, dax and quota code due to the nature of the changes being made.

    As for the rest of the XFS changes, there are lots of little things
    all over the place, which add up to a lot of changes in the end.

    The major changes are that we've reduced the size of the struct
    xfs_inode by ~100 bytes (gives an inode cache footprint reduction of
    >10%), the writepage code now only does a single set of mapping tree
    lockups so uses less CPU, delayed allocation reservations won't
    overrun under random write loads anymore, and we added compile time
    verification for on-disk structure sizes so we find out when a commit
    or platform/compiler change breaks the on disk structure as early as
    possible.

    Change summary:

    - error propagation for direct IO failures fixes for both XFS and
    ext4
    - new quota interfaces and XFS implementation for iterating all the
    quota IDs in the filesystem
    - locking fixes for real-time device extent allocation
    - reduction of duplicate information in the xfs and vfs inode, saving
    roughly 100 bytes of memory per cached inode.
    - buffer flag cleanup
    - rework of the writepage code to use the generic write clustering
    mechanisms
    - several fixes for inode flag based DAX enablement
    - rework of remount option parsing
    - compile time verification of on-disk format structure sizes
    - delayed allocation reservation overrun fixes
    - lots of little error handling fixes
    - small memory leak fixes
    - enable xfsaild freezing again"

    * tag 'xfs-for-linus-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (66 commits)
    xfs: always set rvalp in xfs_dir2_node_trim_free
    xfs: ensure committed is initialized in xfs_trans_roll
    xfs: borrow indirect blocks from freed extent when available
    xfs: refactor delalloc indlen reservation split into helper
    xfs: update freeblocks counter after extent deletion
    xfs: debug mode forced buffered write failure
    xfs: remove impossible condition
    xfs: check sizes of XFS on-disk structures at compile time
    xfs: ioends require logically contiguous file offsets
    xfs: use named array initializers for log item dumping
    xfs: fix computation of inode btree maxlevels
    xfs: reinitialise per-AG structures if geometry changes during recovery
    xfs: remove xfs_trans_get_block_res
    xfs: fix up inode32/64 (re)mount handling
    xfs: fix format specifier , should be %llx and not %llu
    xfs: sanitize remount options
    xfs: convert mount option parsing to tokens
    xfs: fix two memory leaks in xfs_attr_list.c error paths
    xfs: XFS_DIFLAG2_DAX limited by PAGE_SIZE
    xfs: dynamically switch modes when XFS_DIFLAG2_DAX is set/cleared
    ...

    Linus Torvalds
     

18 Mar, 2016

2 commits

  • Pull ext4 updates from Ted Ts'o:
    "Performance improvements in SEEK_DATA and xattr scalability
    improvements, plus a lot of clean ups and bug fixes"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (38 commits)
    ext4: clean up error handling in the MMP support
    jbd2: do not fail journal because of frozen_buffer allocation failure
    ext4: use __GFP_NOFAIL in ext4_free_blocks()
    ext4: fix compile error while opening the macro DOUBLE_CHECK
    ext4: print ext4 mount option data_err=abort correctly
    ext4: fix NULL pointer dereference in ext4_mark_inode_dirty()
    ext4: drop unneeded BUFFER_TRACE in ext4_delete_inline_entry()
    ext4: fix misspellings in comments.
    jbd2: fix FS corruption possibility in jbd2_journal_destroy() on umount path
    ext4: more efficient SEEK_DATA implementation
    ext4: cleanup handling of bh->b_state in DAX mmap
    ext4: return hole from ext4_map_blocks()
    ext4: factor out determining of hole size
    ext4: fix setting of referenced bit in ext4_es_lookup_extent()
    ext4: remove i_ioend_count
    ext4: simplify io_end handling for AIO DIO
    ext4: move trans handling and completion deferal out of _ext4_get_block
    ext4: rename and split get blocks functions
    ext4: use i_mutex to serialize unaligned AIO DIO
    ext4: pack ioend structure better
    ...

    Linus Torvalds
     
  • Pull crypto update from Herbert Xu:
    "Here is the crypto update for 4.6:

    API:
    - Convert remaining crypto_hash users to shash or ahash, also convert
    blkcipher/ablkcipher users to skcipher.
    - Remove crypto_hash interface.
    - Remove crypto_pcomp interface.
    - Add crypto engine for async cipher drivers.
    - Add akcipher documentation.
    - Add skcipher documentation.

    Algorithms:
    - Rename crypto/crc32 to avoid name clash with lib/crc32.
    - Fix bug in keywrap where we zero the wrong pointer.

    Drivers:
    - Support T5/M5, T7/M7 SPARC CPUs in n2 hwrng driver.
    - Add PIC32 hwrng driver.
    - Support BCM6368 in bcm63xx hwrng driver.
    - Pack structs for 32-bit compat users in qat.
    - Use crypto engine in omap-aes.
    - Add support for sama5d2x SoCs in atmel-sha.
    - Make atmel-sha available again.
    - Make sahara hashing available again.
    - Make ccp hashing available again.
    - Make sha1-mb available again.
    - Add support for multiple devices in ccp.
    - Improve DMA performance in caam.
    - Add hashing support to rockchip"

    * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (116 commits)
    crypto: qat - remove redundant arbiter configuration
    crypto: ux500 - fix checks of error code returned by devm_ioremap_resource()
    crypto: atmel - fix checks of error code returned by devm_ioremap_resource()
    crypto: qat - Change the definition of icp_qat_uof_regtype
    hwrng: exynos - use __maybe_unused to hide pm functions
    crypto: ccp - Add abstraction for device-specific calls
    crypto: ccp - CCP versioning support
    crypto: ccp - Support for multiple CCPs
    crypto: ccp - Remove check for x86 family and model
    crypto: ccp - memset request context to zero during import
    lib/mpi: use "static inline" instead of "extern inline"
    lib/mpi: avoid assembler warning
    hwrng: bcm63xx - fix non device tree compatibility
    crypto: testmgr - allow rfc3686 aes-ctr variants in fips mode.
    crypto: qat - The AE id should be less than the maximal AE number
    lib/mpi: Endianness fix
    crypto: rockchip - add hash support for crypto engine in rk3288
    crypto: xts - fix compile errors
    crypto: doc - add skcipher API documentation
    crypto: doc - update AEAD AD handling
    ...

    Linus Torvalds
     

14 Mar, 2016

3 commits

  • There is memory leak as both caller function kmmpd() and callee
    read_mmp_block() not releasing bh_check (i.e buffer_head).
    Given patch fixes this problem.

    [ Additional changes suggested by Andreas Dilger -- TYT ]

    Signed-off-by: Jadhav Vikram
    Signed-off-by: Theodore Ts'o

    vikram.jadhav07
     
  • This might be unexpected but pages allocated for sbi->s_buddy_cache are
    charged to current memory cgroup. So, GFP_NOFS allocation could fail if
    current task has been killed by OOM or if current memory cgroup has no
    free memory left. Block allocator cannot handle such failures here yet.

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Theodore Ts'o

    Konstantin Khlebnikov
     
  • the error is:
    fs/ext4/mballoc.c:475:43: error: 'struct ext4_group_info' has
    no member named 'bb_bitmap'.
    so, the definition of macro DOUBLE_CHECK should before
    'struct ext4_group_info', I fixed it, and I moved the macro
    AGGRESSIVE_CHECK together, because I think they shoule be together.

    Signed-off-by: Aihua Zhang
    Signed-off-by: Theodore Ts'o

    Aihua Zhang
     

13 Mar, 2016

2 commits

  • If data_err=abort option is specified for an ext3/ext4 mount,
    /proc/mounts does show it as "(null)". This is caused by token2str()
    returning NULL for Opt_data_err_abort (due to its pattern containing
    '=').

    Signed-off-by: Ales Novak
    Signed-off-by: Theodore Ts'o

    Ales Novak
     
  • ext4_reserve_inode_write() in ext4_mark_inode_dirty() could fail on
    error (e.g. EIO) and iloc.bh can be NULL in this case. But the error is
    ignored in the following "if" condition and ext4_expand_extra_isize()
    might be called with NULL iloc.bh set, which triggers NULL pointer
    dereference.

    This is uncovered by commit 8b4953e13f4c ("ext4: reserve code points for
    the project quota feature"), which enlarges the ext4_inode size, and
    run the following script on new kernel but with old mke2fs:

    #/bin/bash
    mnt=/mnt/ext4
    devname=ext4-error
    dev=/dev/mapper/$devname
    fsimg=/home/fs.img

    trap cleanup 0 1 2 3 9 15

    cleanup()
    {
    umount $mnt >/dev/null 2>&1
    dmsetup remove $devname
    losetup -d $backend_dev
    rm -f $fsimg
    exit 0
    }

    rm -f $fsimg
    fallocate -l 1g $fsimg
    backend_dev=`losetup -f --show $fsimg`
    devsize=`blockdev --getsz $backend_dev`

    good_tab="0 $devsize linear $backend_dev 0"
    error_tab="0 $devsize error $backend_dev 0"

    dmsetup create $devname --table "$good_tab"

    mkfs -t ext4 $dev
    mount -t ext4 -o errors=continue,strictatime $dev $mnt

    dmsetup load $devname --table "$error_tab" && dmsetup resume $devname
    echo 3 > /proc/sys/vm/drop_caches
    ls -l $mnt
    exit 0

    [ Patch changed to simplify the function a tiny bit. -- Ted ]

    Signed-off-by: Eryu Guan
    Signed-off-by: Theodore Ts'o

    Eryu Guan
     

10 Mar, 2016

9 commits

  • BUFFER_TRACE info "call ext4_handle_dirty_metadata" doesn't match the
    code, so drop it.

    Signed-off-by: Geliang Tang
    Signed-off-by: Theodore Ts'o

    Geliang Tang
     
  • Signed-off-by: Adam Buchbinder
    Signed-off-by: Theodore Ts'o

    Adam Buchbinder
     
  • Using SEEK_DATA in a huge sparse file can easily lead to sotflockups as
    ext4_seek_data() iterates hole block-by-block. Fix the problem by using
    returned hole size from ext4_map_blocks() and thus skip the hole in one
    go.

    Update also SEEK_HOLE implementation to follow the same pattern as
    SEEK_DATA to make future maintenance easier.

    Furthermore we add cond_resched() to both ext4_seek_data() and
    ext4_seek_hole() to avoid softlockups in case evil user creates huge
    fragmented file and we have to go through lots of extents.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • ext4_dax_mmap_get_block() updates bh->b_state directly instead of using
    ext4_update_bh_state(). This is mostly a cosmetic issue since DAX code
    always passes on-stack buffer_head but clean this up to make code more
    uniform.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Currently, ext4_map_blocks() just returns 0 when it finds a hole and
    allocation is not requested. However we have all the information
    available to tell how large the hole actually is and there are callers
    of ext4_map_blocks() which would save some block-by-block hole iteration
    if they knew this information. So fill in struct ext4_map_blocks even
    for holes with the information we have. We keep returning 0 for holes to
    maintain backward compatibility of the function.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • ext4_ext_put_gap_in_cache() determines hole size in the extent tree,
    then trims this with possible delayed allocated blocks, and inserts the
    result into the extent status tree. Factor out determination of the size
    of the hole in the extent tree as we will need this information in
    ext4_ext_map_blocks() as well.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Pull ext4 fix from Ted Ts'o:
    "This fixes a regression which crept in v4.5-rc5"

    * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: iterate over buffer heads correctly in move_extent_per_page()

    Linus Torvalds
     
  • We were setting referenced bit on the extent structure we return from
    ext4_es_lookup_extent() which is just a private structure on stack. Thus
    setting had no effect. Set the bit in the structure in the status tree
    instead.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • In commit bcff24887d00 ("ext4: don't read blocks from disk after extents
    being swapped") bh is not updated correctly in the for loop and wrong
    data has been written to disk. generic/324 catches this on sub-page
    block size ext4.

    Fixes: bcff24887d00 ("ext4: don't read blocks from disk after extentsbeing swapped")
    Signed-off-by: Eryu Guan
    Signed-off-by: Theodore Ts'o

    Eryu Guan
     

09 Mar, 2016

5 commits

  • Remove counter of pending io ends as it is unused.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • When mapping blocks for direct IO, we allocate io_end structure before
    mapping blocks and store pointer to it in the inode. This creates a
    requirement that any AIO DIO using io_end must be protected by i_mutex.
    This created problems in the past with dioread_nolock mode which was
    corrupting io_end pointers. Also io_end is allocated unnecessarily in
    case where we don't need to convert any extents (which is a common case
    for example when overwriting file).

    We fix the problem by allocating io_end only once we return unwritten
    extent from block mapping function for AIO DIO (so we can save some
    pointless io_end allocations) and we pass pointer to it in bh->b_private
    which generic DIO code later passes to our end IO callback. That way we
    remove any need for global pointer to io_end structure and thus fix the
    races.

    The downside of this change is that the checking for unwritten IO in
    flight in ext4_extents_can_be_merged() is more racy since we now
    increment i_unwritten / set EXT4_STATE_DIO_UNWRITTEN only after dropping
    i_data_sem. However the check has been racy already before because
    ext4_writepages() already increment i_unwritten after dropping
    i_data_sem and reserved blocks save us from hitting ENOSPC in the worst
    case.

    Signed-off-by: Jan Kara

    Jan Kara
     
  • There is no need to handle starting of a transaction and deferal of DIO
    completion in _ext4_get_block() function. We can move this out to get
    block functions for direct IO that need it. That way we can add stricter
    checks verifying things work as we expect.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Rename ext4_get_blocks_write() to ext4_get_blocks_unwritten() to better
    describe what it does. Also split out get blocks functions for direct
    IO. Later we move functionality from _ext4_get_blocks() there. There's no
    functional change in this patch.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Currently we've used hashed aio_mutex to serialize unaligned AIO DIO.
    However the code cleanups that happened after 2011 when the lock was
    introduced made aio_mutex acquired at almost the same places where we
    already have exclusion using i_mutex. So just use i_mutex for the
    exclusion of unaligned AIO DIO.

    The change moves waiting for pending unwritten extent conversion under
    i_mutex. That makes special handling of O_APPEND writes unnecessary and
    also avoids possible livelocking of unaligned AIO DIO with aligned one
    (nothing was preventing contiguous stream of aligned AIO DIOs to let
    unaligned AIO DIO wait forever).

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara