23 Oct, 2012

2 commits

  • Pull ext4 fixes from Ted Ts'o:
    "Various bug fixes for ext4. The most serious of them fixes a security
    bug (CVE-2012-4508) which leads to stale data exposure when we have
    fallocate racing against writes to files undergoing delayed
    allocation. We also have two fixes for the metadata checksum feature,
    the most serious of which can cause the superblock to have a invalid
    checksum after a power failure."

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: Avoid underflow in ext4_trim_fs()
    ext4: Checksum the block bitmap properly with bigalloc enabled
    ext4: fix undefined bit shift result in ext4_fill_flex_info
    ext4: fix metadata checksum calculation for the superblock
    ext4: race-condition protection for ext4_convert_unwritten_extents_endio
    ext4: serialize fallocate with ext4_convert_unwritten_extents

    Linus Torvalds
     
  • Currently if len argument in ext4_trim_fs() is smaller than one block,
    the 'end' variable underflow. Avoid that by returning EINVAL if len is
    smaller than file system block.

    Also remove useless unlikely().

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@vger.kernel.org

    Lukas Czerner
     

22 Oct, 2012

1 commit

  • In mke2fs, we only checksum the whole bitmap block and it is right.
    While in the kernel, we use EXT4_BLOCKS_PER_GROUP to indicate the
    size of the checksumed bitmap which is wrong when we enable bigalloc.
    The right size should be EXT4_CLUSTERS_PER_GROUP and this patch fixes
    it.

    Also as every caller of ext4_block_bitmap_csum_set and
    ext4_block_bitmap_csum_verify pass in EXT4_BLOCKS_PER_GROUP(sb)/8,
    we'd better removes this parameter and sets it in the function itself.

    Signed-off-by: Tao Ma
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Lukas Czerner
    Cc: stable@vger.kernel.org

    Tao Ma
     

16 Oct, 2012

1 commit

  • The result of the bit shift expression in
    '1 << sbi->s_log_groups_per_flex' can be undefined in the case that
    s_log_groups_per_flex is 31 because the result of the shift is bigger
    than INT_MAX. In reality this probably should not cause much problems
    since we'll end up with INT_MIN which will then be converted into
    'unsigned int' type, but nevertheless according to the ISO C99 the
    result is actually undefined.

    Fix this by changing the left operand to 'unsigned int' type.

    Note that the commit d50f2ab6f050311dbf7b8f5501b25f0bf64a439b already
    tried to fix the undefined behaviour, but this was missed.

    Thanks to Laszlo Ersek for pointing this out and suggesting the fix.

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Carlos Maiolino
    Reported-by: Laszlo Ersek

    Lukas Czerner
     

10 Oct, 2012

2 commits

  • The function ext4_handle_dirty_super() was calculating the superblock
    on the wrong block data. As a result, when the superblock is modified
    while it is mounted (most commonly, when inodes are added or removed
    from the orphan list), the superblock checksum would be wrong. We
    didn't notice because the superblock *was* being correctly calculated
    in ext4_commit_super(), and this would get called when the file system
    was unmounted. So the problem only became obvious if the system
    crashed while the file system was mounted.

    Fix this by removing the poorly designed function signature for
    ext4_superblock_csum_set(); if it only took a single argument, the
    pointer to a struct superblock, the ambiguity which caused this
    mistake would have been impossible.

    Reported-by: George Spelvin
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     
  • We assumed that at the time we call ext4_convert_unwritten_extents_endio()
    extent in question is fully inside [map.m_lblk, map->m_len] because
    it was already split during submission. But this may not be true due to
    a race between writeback vs fallocate.

    If extent in question is larger than requested we will split it again.
    Special precautions should being done if zeroout required because
    [map.m_lblk, map->m_len] already contains valid data.

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@vger.kernel.org

    Dmitry Monakhov
     

09 Oct, 2012

1 commit

  • Move actual pte filling for non-linear file mappings into the new special
    vma operation: ->remap_pages().

    Filesystems must implement this method to get non-linear mapping support,
    if it uses filemap_fault() then generic_file_remap_pages() can be used.

    Now device drivers can implement this method and obtain nonlinear vma support.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf #arch/tile
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

08 Oct, 2012

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "The big new feature added this time is supporting online resizing
    using the meta_bg feature. This allows us to resize file systems
    which are greater than 16TB. In addition, the speed of online
    resizing has been improved in general.

    We also fix a number of races, some of which could lead to deadlocks,
    in ext4's Asynchronous I/O and online defrag support, thanks to good
    work by Dmitry Monakhov.

    There are also a large number of more minor bug fixes and cleanups
    from a number of other ext4 contributors, quite of few of which have
    submitted fixes for the first time."

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (69 commits)
    ext4: fix ext4_flush_completed_IO wait semantics
    ext4: fix mtime update in nodelalloc mode
    ext4: fix ext_remove_space for punch_hole case
    ext4: punch_hole should wait for DIO writers
    ext4: serialize truncate with owerwrite DIO workers
    ext4: endless truncate due to nonlocked dio readers
    ext4: serialize unlocked dio reads with truncate
    ext4: serialize dio nonlocked reads with defrag workers
    ext4: completed_io locking cleanup
    ext4: fix unwritten counter leakage
    ext4: give i_aiodio_unwritten a more appropriate name
    ext4: ext4_inode_info diet
    ext4: convert to use leXX_add_cpu()
    ext4: ext4_bread usage audit
    fs: reserve fallocate flag codepoint
    ext4: remove redundant offset check in mext_check_arguments()
    ext4: don't clear orphan list on ro mount with errors
    jbd2: fix assertion failure in commit code due to lacking transaction credits
    ext4: release donor reference when EXT4_IOC_MOVE_EXT ioctl fails
    ext4: enable FITRIM ioctl on bigalloc file system
    ...

    Linus Torvalds
     

05 Oct, 2012

2 commits

  • Fallocate should wait for pended ext4_convert_unwritten_extents()
    otherwise following race may happen:

    ftruncate( ,12288);
    fallocate( ,0, 4096)
    io_sibmit( ,0, 4096); /* Write to fallocated area, split extent if needed */
    fallocate( ,0, 8192); /* Grow extent and broke assumption about extent */

    Later kwork completion will do:
    ->ext4_convert_unwritten_extents (0, 4096)
    ->ext4_map_blocks(handle, inode, &map, EXT4_GET_BLOCKS_IO_CONVERT_EXT);
    ->ext4_ext_map_blocks() /* Will find new extent: ex = [0,2] !!!!!! */
    ->ext4_ext_handle_uninitialized_extents()
    ->ext4_convert_unwritten_extents_endio()
    /* convert [0,2] extent to initialized, but only[0,1] was written */

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     
  • BUG #1) All places where we call ext4_flush_completed_IO are broken
    because buffered io and DIO/AIO goes through three stages
    1) submitted io,
    2) completed io (in i_completed_io_list) conversion pended
    3) finished io (conversion done)
    And by calling ext4_flush_completed_IO we will flush only
    requests which were in (2) stage, which is wrong because:
    1) punch_hole and truncate _must_ wait for all outstanding unwritten io
    regardless to it's state.
    2) fsync and nolock_dio_read should also wait because there is
    a time window between end_page_writeback() and ext4_add_complete_io()
    As result integrity fsync is broken in case of buffered write
    to fallocated region:
    fsync blkdev_completion
    ->filemap_write_and_wait_range
    ->ext4_end_bio
    ->end_page_writeback
    ext4_flush_completed_IO
    sees empty i_completed_io_list but pended
    conversion still exist
    ->ext4_add_complete_io

    BUG #2) Race window becomes wider due to the 'ext4: completed_io
    locking cleanup V4' patch series

    This patch make following changes:
    1) ext4_flush_completed_io() now first try to flush completed io and when
    wait for any outstanding unwritten io via ext4_unwritten_wait()
    2) Rename function to more appropriate name.
    3) Assert that all callers of ext4_flush_unwritten_io should hold i_mutex to
    prevent endless wait

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara

    Dmitry Monakhov
     

03 Oct, 2012

3 commits

  • Pull vfs update from Al Viro:

    - big one - consolidation of descriptor-related logics; almost all of
    that is moved to fs/file.c

    (BTW, I'm seriously tempted to rename the result to fd.c. As it is,
    we have a situation when file_table.c is about handling of struct
    file and file.c is about handling of descriptor tables; the reasons
    are historical - file_table.c used to be about a static array of
    struct file we used to have way back).

    A lot of stray ends got cleaned up and converted to saner primitives,
    disgusting mess in android/binder.c is still disgusting, but at least
    doesn't poke so much in descriptor table guts anymore. A bunch of
    relatively minor races got fixed in process, plus an ext4 struct file
    leak.

    - related thing - fget_light() partially unuglified; see fdget() in
    there (and yes, it generates the code as good as we used to have).

    - also related - bits of Cyrill's procfs stuff that got entangled into
    that work; _not_ all of it, just the initial move to fs/proc/fd.c and
    switch of fdinfo to seq_file.

    - Alex's fs/coredump.c spiltoff - the same story, had been easier to
    take that commit than mess with conflicts. The rest is a separate
    pile, this was just a mechanical code movement.

    - a few misc patches all over the place. Not all for this cycle,
    there'll be more (and quite a few currently sit in akpm's tree)."

    Fix up trivial conflicts in the android binder driver, and some fairly
    simple conflicts due to two different changes to the sock_alloc_file()
    interface ("take descriptor handling from sock_alloc_file() to callers"
    vs "net: Providing protocol type via system.sockprotoname xattr of
    /proc/PID/fd entries" adding a dentry name to the socket)

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
    MAX_LFS_FILESIZE should be a loff_t
    compat: fs: Generic compat_sys_sendfile implementation
    fs: push rcu_barrier() from deactivate_locked_super() to filesystems
    btrfs: reada_extent doesn't need kref for refcount
    coredump: move core dump functionality into its own file
    coredump: prevent double-free on an error path in core dumper
    usb/gadget: fix misannotations
    fcntl: fix misannotations
    ceph: don't abuse d_delete() on failure exits
    hypfs: ->d_parent is never NULL or negative
    vfs: delete surplus inode NULL check
    switch simple cases of fget_light to fdget
    new helpers: fdget()/fdput()
    switch o2hb_region_dev_write() to fget_light()
    proc_map_files_readdir(): don't bother with grabbing files
    make get_file() return its argument
    vhost_set_vring(): turn pollstart/pollstop into bool
    switch prctl_set_mm_exe_file() to fget_light()
    switch xfs_find_handle() to fget_light()
    switch xfs_swapext() to fget_light()
    ...

    Linus Torvalds
     
  • There's no reason to call rcu_barrier() on every
    deactivate_locked_super(). We only need to make sure that all delayed rcu
    free inodes are flushed before we destroy related cache.

    Removing rcu_barrier() from deactivate_locked_super() affects some fast
    paths. E.g. on my machine exit_group() of a last process in IPC
    namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time.

    Signed-off-by: Kirill A. Shutemov
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Kirill A. Shutemov
     
  • Pull user namespace changes from Eric Biederman:
    "This is a mostly modest set of changes to enable basic user namespace
    support. This allows the code to code to compile with user namespaces
    enabled and removes the assumption there is only the initial user
    namespace. Everything is converted except for the most complex of the
    filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
    nfs, ocfs2 and xfs as those patches need a bit more review.

    The strategy is to push kuid_t and kgid_t values are far down into
    subsystems and filesystems as reasonable. Leaving the make_kuid and
    from_kuid operations to happen at the edge of userspace, as the values
    come off the disk, and as the values come in from the network.
    Letting compile type incompatible compile errors (present when user
    namespaces are enabled) guide me to find the issues.

    The most tricky areas have been the places where we had an implicit
    union of uid and gid values and were storing them in an unsigned int.
    Those places were converted into explicit unions. I made certain to
    handle those places with simple trivial patches.

    Out of that work I discovered we have generic interfaces for storing
    quota by projid. I had never heard of the project identifiers before.
    Adding full user namespace support for project identifiers accounts
    for most of the code size growth in my git tree.

    Ultimately there will be work to relax privlige checks from
    "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
    root in a user names to do those things that today we only forbid to
    non-root users because it will confuse suid root applications.

    While I was pushing kuid_t and kgid_t changes deep into the audit code
    I made a few other cleanups. I capitalized on the fact we process
    netlink messages in the context of the message sender. I removed
    usage of NETLINK_CRED, and started directly using current->tty.

    Some of these patches have also made it into maintainer trees, with no
    problems from identical code from different trees showing up in
    linux-next.

    After reading through all of this code I feel like I might be able to
    win a game of kernel trivial pursuit."

    Fix up some fairly trivial conflicts in netfilter uid/git logging code.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
    userns: Convert the ufs filesystem to use kuid/kgid where appropriate
    userns: Convert the udf filesystem to use kuid/kgid where appropriate
    userns: Convert ubifs to use kuid/kgid
    userns: Convert squashfs to use kuid/kgid where appropriate
    userns: Convert reiserfs to use kuid and kgid where appropriate
    userns: Convert jfs to use kuid/kgid where appropriate
    userns: Convert jffs2 to use kuid and kgid where appropriate
    userns: Convert hpfs to use kuid and kgid where appropriate
    userns: Convert btrfs to use kuid/kgid where appropriate
    userns: Convert bfs to use kuid/kgid where appropriate
    userns: Convert affs to use kuid/kgid wherwe appropriate
    userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
    userns: On ia64 deal with current_uid and current_gid being kuid and kgid
    userns: On ppc convert current_uid from a kuid before printing.
    userns: Convert s390 getting uid and gid system calls to use kuid and kgid
    userns: Convert s390 hypfs to use kuid and kgid where appropriate
    userns: Convert binder ipc to use kuids
    userns: Teach security_path_chown to take kuids and kgids
    userns: Add user namespace support to IMA
    userns: Convert EVM to deal with kuids and kgids in it's hmac computation
    ...

    Linus Torvalds
     

02 Oct, 2012

1 commit

  • Pull the trivial tree from Jiri Kosina:
    "Tiny usual fixes all over the place"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
    doc: fix old config name of kprobetrace
    fs/fs-writeback.c: cleanup riteback_sb_inodes kerneldoc
    btrfs: fix the commment for the action flags in delayed-ref.h
    btrfs: fix trivial typo for the comment of BTRFS_FREE_INO_OBJECTID
    vfs: fix kerneldoc for generic_fh_to_parent()
    treewide: fix comment/printk/variable typos
    ipr: fix small coding style issues
    doc: fix broken utf8 encoding
    nfs: comment fix
    platform/x86: fix asus_laptop.wled_type module parameter
    mfd: printk/comment fixes
    doc: getdelays.c: remember to close() socket on error in create_nl_socket()
    doc: aliasing-test: close fd on write error
    mmc: fix comment typos
    dma: fix comments
    spi: fix comment/printk typos in spi
    Coccinelle: fix typo in memdup_user.cocci
    tmiofb: missing NULL pointer checks
    tools: perf: Fix typo in tools/perf
    tools/testing: fix comment / output typos
    ...

    Linus Torvalds
     

01 Oct, 2012

3 commits

  • Commits 5e8830dc85d0 and 41c4d25f78c0 introduced a regression into
    v3.6-rc1 for ext4 in nodealloc mode, such that mtime updates would not
    take place for files modified via mmap if the page was already in the
    page cache. This would also affect ext3 file systems mounted using
    the ext4 file system driver.

    The problem was that ext4_page_mkwrite() had a shortcut which would
    avoid calling __block_page_mkwrite() under some circumstances, and the
    above two commit transferred the responsibility of calling
    file_update_time() to __block_page_mkwrite --- which woudln't get
    called in some circumstances.

    Since __block_page_mkwrite() only has three callers,
    block_page_mkwrite(), ext4_page_mkwrite, and nilfs_page_mkwrite(), the
    best way to solve this is to move the responsibility for calling
    file_update_time() to its caller.

    This problem was found via xfstests #215 with a file system mounted
    with -o nodelalloc.

    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara
    Cc: KONISHI Ryusuke
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     
  • Inode is allowed to have empty leaf only if it this is blockless inode.

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     
  • punch_hole is the place where we have to wait for all existing writers
    (writeback, aio, dio), but currently we simply flush pended end_io request
    which is not sufficient. Other issue is that punch_hole performed w/o i_mutex
    held which obviously result in dangerous data corruption due to
    write-after-free.

    This patch performs following changes:
    - Guard punch_hole with i_mutex
    - Recheck inode flags under i_mutex
    - Block all new dio readers in order to prevent information leak caused by
    read-after-free pattern.
    - punch_hole now wait for all writers in flight
    NOTE: XXX write-after-free race is still possible because new dirty pages
    may appear due to mmap(), and currently there is no easy way to stop
    writeback while punch_hole is in progress.

    [ Fixed error return from ext4_ext_punch_hole() to make sure that we
    release i_mutex before returning EPERM or ETXTBUSY -- Ted ]

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     

29 Sep, 2012

8 commits

  • Jan Kara have spotted interesting issue:
    There are potential data corruption issue with direct IO overwrites
    racing with truncate:
    Like:
    dio write truncate_task
    ->ext4_ext_direct_IO
    ->overwrite == 1
    ->down_read(&EXT4_I(inode)->i_data_sem);
    ->mutex_unlock(&inode->i_mutex);
    ->ext4_setattr()
    ->inode_dio_wait()
    ->truncate_setsize()
    ->ext4_truncate()
    ->down_write(&EXT4_I(inode)->i_data_sem);
    ->__blockdev_direct_IO
    ->ext4_get_block
    ->submit_io()
    ->up_read(&EXT4_I(inode)->i_data_sem);
    # truncate data blocks, allocate them to
    # other inode - bad stuff happens because
    # dio is still in flight.

    In order to serialize with truncate dio worker should grab extra i_dio_count
    reference before drop i_mutex.

    Reviewed-by: Jan Kara
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     
  • If we have enough aggressive DIO readers, truncate and other dio
    waiters will wait forever inside inode_dio_wait(). It is reasonable
    to disable nonlock DIO read optimization during truncate.

    Reviewed-by: Jan Kara
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     
  • Current serialization will works only for DIO which holds
    i_mutex, but nonlocked DIO following race is possible:

    dio_nolock_read_task truncate_task
    ->ext4_setattr()
    ->inode_dio_wait()
    ->ext4_ext_direct_IO
    ->ext4_ind_direct_IO
    ->__blockdev_direct_IO
    ->ext4_get_block
    ->truncate_setsize()
    ->ext4_truncate()
    #alloc truncated blocks
    #to other inode
    ->submit_io()
    #INFORMATION LEAK

    In order to serialize with unlocked DIO reads we have to
    rearrange wait sequence
    1) update i_size first
    2) if i_size about to be reduced wait for outstanding DIO requests
    3) and only after that truncate inode blocks

    Reviewed-by: Jan Kara
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     
  • Inode's block defrag and ext4_change_inode_journal_flag() may
    affect nonlocked DIO reads result, so proper synchronization
    required.

    - Add missed inode_dio_wait() calls where appropriate
    - Check inode state under extra i_dio_count reference.

    Reviewed-by: Jan Kara
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     
  • Current unwritten extent conversion state-machine is very fuzzy.
    - For unknown reason it performs conversion under i_mutex. What for?
    My diagnosis:
    We already protect extent tree with i_data_sem, truncate and punch_hole
    should wait for DIO, so the only data we have to protect is end_io->flags
    modification, but only flush_completed_IO and end_io_work modified this
    flags and we can serialize them via i_completed_io_lock.

    Currently all these games with mutex_trylock result in the following deadlock
    truncate: kworker:
    ext4_setattr ext4_end_io_work
    mutex_lock(i_mutex)
    inode_dio_wait(inode) ->BLOCK
    DEADLOCKflags modification
    is protected by ei->ext4_complete_io_lock

    Full list of changes:
    - Move all completion end_io related routines to page-io.c in order to improve
    logic locality
    - Move open coded logic from various xx_end_xx routines to ext4_add_complete_io()
    - remove EXT4_IO_END_FSYNC
    - Improve SMP scalability by removing useless i_mutex which does not
    protect io->flags anymore.
    - Reduce lock contention on i_completed_io_lock by optimizing list walk.
    - Rename ext4_end_io_nolock to end4_end_io and make it static
    - Check flush completion status to ext4_ext_punch_hole(). Because it is
    not good idea to punch blocks from corrupted inode.

    Changes since V3 (in request to Jan's comments):
    Fall back to active flush_completed_IO() approach in order to prevent
    performance issues with nolocked DIO reads.
    Changes since V2:
    Fix use-after-free caused by race truncate vs end_io_work

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     
  • ext4_set_io_unwritten_flag() will increment i_unwritten counter, so
    once we mark end_io with EXT4_END_IO_UNWRITTEN we have to revert it back
    on error path.

    - add missed error checks to prevent counter leakage
    - ext4_end_io_nolock() will clear EXT4_END_IO_UNWRITTEN flag to signal
    that conversion finished.
    - add BUG_ON to ext4_free_end_io() to prevent similar leakage in future.

    Visible effect of this bug is that unaligned aio_stress may deadlock

    Reviewed-by: Jan Kara
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     
  • AIO/DIO prefix is wrong because it account unwritten extents which
    also may be scheduled from buffered write endio

    Reviewed-by: Jan Kara
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     
  • Generic inode has unused i_private pointer which may be used as cur_aio_dio
    storage.

    TODO: If cur_aio_dio will be passed as an argument to get_block_t this allow
    to have concurent AIO_DIO requests.

    Reviewed-by: Zheng Liu
    Reviewed-by: Jan Kara
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     

27 Sep, 2012

15 commits

  • Convert cpu_to_leXX(leXX_to_cpu(E1) + E2) to use leXX_add_cpu().

    dpatch engine is used to auto generate this patch.
    (https://github.com/weiyj/dpatch)

    Signed-off-by: Wei Yongjun
    Signed-off-by: "Theodore Ts'o"

    Wei Yongjun
     
  • When ext4_bread() returns NULL and err is set to zero, this means
    there is no phyical block mapped to the specified logical block
    number. (Previous to commit 90b0a97323, err was uninitialized in this
    case, which caused other problems.)

    The directory handling routines use ext4_bread() in many places, the
    fact that ext4_bread() now returns NULL with err set to zero could
    cause problems since a number of these functions will simply return
    the value of err if the result of ext4_bread() was the NULL pointer,
    causing the caller of the function to think that the function was
    successful.

    Since directories should never contain holes, this case can only
    happen if the file system is corrupted. This commit audits all of the
    callers of ext4_bread(), and makes sure they do the right thing if a
    hole in a directory is found by ext4_bread().

    Some ext4_bread() callers did not need any changes either because they
    already had its own hole detector paths.

    Signed-off-by: Carlos Maiolino
    Signed-off-by: "Theodore Ts'o"

    Carlos Maiolino
     
  • In the check code above, if orig_start != donor_start, we would
    return -EINVAL. So here, orig_start should be equal with donor_start.
    Remove the redundant check here.

    Signed-off-by: Wang Sheng-Hui
    Signed-off-by: "Theodore Ts'o"

    Wang Sheng-Hui
     
  • If the file system contains errors and it is being mounted read-only,
    don't clear the orphan list. We should minimize changes to the file
    system if it is mounted read-only.

    Signed-off-by: Eric Sandeen
    Signed-off-by: "Theodore Ts'o"

    Eric Sandeen
     
  • When the EXT4_IOC_MOVE_EXT ioctl() fails on bigalloc file systems, we
    should jump to the 'mext_out' label to release the donor file reference.

    Signed-off-by: Djalal Harouni
    Signed-off-by: "Theodore Ts'o"

    Djalal Harouni
     
  • With a minor tweaks regarding minimum extent size to discard and
    discarded bytes reporting the FITRIM can be enabled on bigalloc file
    system and it works without any problem.

    This patch fixes minlen handling and discarded bytes reporting to
    take into consideration bigalloc enabled file systems and finally
    removes the restriction and allow FITRIM to be used on file system with
    bigalloc feature enabled.

    Reviewed-by: Carlos Maiolino
    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"

    Lukas Czerner
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Code tracking when transaction needs to be committed on fdatasync(2) forgets
    to handle a situation when only inode's i_size is changed. Thus in such
    situations fdatasync(2) doesn't force transaction with new i_size to disk
    and that can result in wrong i_size after a crash.

    Fix the issue by updating inode's i_datasync_tid whenever its size is
    updated.

    CC: # >= 2.6.32
    Reported-by: Kristian Nielsen
    Signed-off-by: Jan Kara

    Jan Kara
     
  • ext4_special_inode_operations have their own ifdef CONFIG_EXT4_FS_XATTR
    to mask those methods. And ext4_iget also always sets it, so there is
    an inconsistency.

    Signed-off-by: Bernd Schubert
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@vger.kernel.org

    Bernd Schubert
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Remove unused function ext4_ext_check_cache() and merge the code back to
    the ext4_ext_in_cache().

    Reviewed-by: Carlos Maiolino
    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"

    Lukas Czerner
     
  • Using kmem_cache_zalloc() instead of kmem_cache_alloc() and memset().

    spatch with a semantic match is used to found this problem.
    (http://coccinelle.lip6.fr/)

    Signed-off-by: Wei Yongjun
    Signed-off-by: "Theodore Ts'o"

    Wei Yongjun
     
  • Uninitialized extent may became initialized(parallel writeback task)
    at any moment after we drop i_data_sem, so we have to recheck extent's
    state after we hold page's lock and i_data_sem.

    If we about to change page's mapping we must hold page's lock in order to
    serialize other users.

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     
  • Non-full list of bugs:
    1) uninitialized extent optimization does not hold page's lock,
    and simply replace brunches after that writeback code goes
    crazy because block mapping changed under it's feets
    kernel BUG at fs/ext4/inode.c:1434! ( 288'th xfstress)

    2) uninitialized extent may became initialized right after we
    drop i_data_sem, so extent state must be rechecked

    3) Locked pages goes uptodate via following sequence:
    ->readpage(page); lock_page(page); use_that_page(page)
    But after readpage() one may invalidate it because it is
    uptodate and unlocked (reclaimer does that)
    As result kernel bug at include/linux/buffer_head.c:133!

    4) We call write_begin() with already opened stansaction which
    result in following deadlock:
    ->move_extent_per_page()
    ->ext4_journal_start()-> hold journal transaction
    ->write_begin()
    ->ext4_da_write_begin()
    ->ext4_nonda_switch()
    ->writeback_inodes_sb_if_idle() --> will wait for journal_stop()

    5) try_to_release_page() may fail and it does fail if one of page's bh was
    pinned by journal

    6) If we about to change page's mapping we MUST hold it's lock during entire
    remapping procedure, this is true for both pages(original and donor one)

    Fixes:

    - Avoid (1) and (2) simply by temproraly drop uninitialized extent handling
    optimization, this will be reimplemented later.

    - Fix (3) by manually forcing page to uptodate state w/o dropping it's lock

    - Fix (4) by rearranging existing locking:
    from: journal_start(); ->write_begin
    to: write_begin(); journal_extend()
    - Fix (5) simply by checking retvalue
    - Fix (6) by locking both (original and donor one) pages during extent swap
    with help of mext_page_double_lock()

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov