10 Nov, 2014

1 commit


08 Nov, 2014

1 commit

  • Pull xfs fixes from Dave Chinner:
    "This update fixes a warning in the new pagecache_isize_extended() and
    updates some related comments, another fix for zero-range
    misbehaviour, and an unforntuately large set of fixes for regressions
    in the bulkstat code.

    The bulkstat fixes are large but necessary. I wouldn't normally push
    such a rework for a -rcX update, but right now xfsdump can silently
    create incomplete dumps on 3.17 and it's possible that even xfsrestore
    won't notice that the dumps were incomplete. Hence we need to get
    this update into 3.17-stable kernels ASAP.

    In more detail, the refactoring work I committed in 3.17 has exposed a
    major hole in our QA coverage. With both xfsdump (the major user of
    bulkstat) and xfsrestore silently ignoring missing files in the
    dump/restore process, incomplete dumps were going unnoticed if they
    were being triggered. Many of the dump/restore filesets were so small
    that they didn't evenhave a chance of triggering the loop iteration
    bugs we introduced in 3.17, so we didn't exercise the code
    sufficiently, either.

    We have already taken steps to improve QA coverage in xfstests to
    avoid this happening again, and I've done a lot of manual verification
    of dump/restore on very large data sets (tens of millions of inodes)
    of the past week to verify this patch set results in bulkstat behaving
    the same way as it does on 3.16.

    Unfortunately, the fixes are not exactly simple - in tracking down the
    problem historic API warts were discovered (e.g xfsdump has been
    working around a 20 year old bug in the bulkstat API for the past 10
    years) and so that complicated the process of diagnosing and fixing
    the problems. i.e. we had to fix bugs in the code as well as
    discover and re-introduce the userspace visible API bugs that we
    unwittingly "fixed" in 3.17 that xfsdump relied on to work correctly.

    Summary:

    - incorrect warnings about i_mutex locking in pagecache_isize_extended()
    and updates comments to match expected locking
    - another zero-range bug fix for stray file size updates
    - a bunch of fixes for regression in the bulkstat code introduced in
    3.17"

    * tag 'xfs-for-linus-3.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
    xfs: track bulkstat progress by agino
    xfs: bulkstat error handling is broken
    xfs: bulkstat main loop logic is a mess
    xfs: bulkstat chunk-formatter has issues
    xfs: bulkstat chunk formatting cursor is broken
    xfs: bulkstat btree walk doesn't terminate
    mm: Fix comment before truncate_setsize()
    xfs: rework zero range to prevent invalid i_size updates
    mm: Remove false WARN_ON from pagecache_isize_extended()
    xfs: Check error during inode btree iteration in xfs_bulkstat()
    xfs: bulkstat doesn't release AGI buffer on error

    Linus Torvalds
     

07 Nov, 2014

6 commits

  • The bulkstat main loop progress is tracked by the "lastino"
    variable, which is a full 64 bit inode. However, the loop actually
    works on agno/agino pairs, and so there's a significant disconnect
    between the rest of the loop and the main cursor. Convert this to
    use the agino, and pass the agino into the chunk formatting function
    and convert it too.

    This gets rid of the inconsistency in the loop processing, and
    finally makes it simple for us to skip inodes at any point in the
    loop simply by incrementing the agino cursor.

    cc: # 3.17
    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • The error propagation is a horror - xfs_bulkstat() returns
    a rval variable which is only set if there are formatter errors. Any
    sort of btree walk error or corruption will cause the bulkstat walk
    to terminate but will not pass an error back to userspace. Worse
    is the fact that formatter errors will also be ignored if any inodes
    were correctly formatted into the user buffer.

    Hence bulkstat can fail badly yet still report success to userspace.
    This causes significant issues with xfsdump not dumping everything
    in the filesystem yet reporting success. It's not until a restore
    fails that there is any indication that the dump was bad and tha
    bulkstat failed. This patch now triggers xfsdump to fail with
    bulkstat errors rather than silently missing files in the dump.

    This now causes bulkstat to fail when the lastino cookie does not
    fall inside an existing inode chunk. The pre-3.17 code tolerated
    that error by allowing the code to move to the next inode chunk
    as the agino target is guaranteed to fall into the next btree
    record.

    With the fixes up to this point in the series, xfsdump now passes on
    the troublesome filesystem image that exposes all these bugs.

    cc:
    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster

    Dave Chinner
     
  • There are a bunch of variables tha tare more wildy scoped than they
    need to be, obfuscated user buffer checks and tortured "next inode"
    tracking. This all needs cleaning up to expose the real issues that
    need fixing.

    cc: # 3.17
    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • The loop construct has issues:
    - clustidx is completely unused, so remove it.
    - the loop tries to be smart by terminating when the
    "freecount" tells it that all inodes are free. Just drop
    it as in most cases we have to scan all inodes in the
    chunk anyway.
    - move the "user buffer left" condition check to the only
    point where we consume space int eh user buffer.
    - move the initialisation of agino out of the loop, leaving
    just a simple loop control logic using the clusteridx.

    Also, double handling of the user buffer variables leads to problems
    tracking the current state - use the cursor variables directly
    rather than keeping local copies and then having to update the
    cursor before returning.

    cc: # 3.17
    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • The xfs_bulkstat_agichunk formatting cursor takes buffer values from
    the main loop and passes them via the structure to the chunk
    formatter, and the writes the changed values back into the main loop
    local variables. Unfortunately, this complex dance is full of corner
    cases that aren't handled correctly.

    The biggest problem is that it is double handling the information in
    both the main loop and the chunk formatting function, leading to
    inconsistent updates and endless loops where progress is not made.

    To fix this, push the struct xfs_bulkstat_agichunk outwards to be
    the primary holder of user buffer information. this removes the
    double handling in the main loop.

    Also, pass the last inode processed by the chunk formatter as a
    separate parameter as it purely an output variable and is not
    related to the user buffer consumption cursor.

    Finally, the chunk formatting code is not shared by anyone, so make
    it local to xfs_itable.c.

    cc: # 3.17
    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • The bulkstat code has several different ways of detecting the end of
    an AG when doing a walk. They are not consistently detected, and the
    code that checks for the end of AG conditions is not consistently
    coded. Hence the are conditions where the walk code can get stuck in
    an endless loop making no progress and not triggering any
    termination conditions.

    Convert all the "tmp/i" status return codes from btree operations
    to a common name (stat) and apply end-of-ag detection to these
    operations consistently.

    cc: # 3.17
    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

06 Nov, 2014

1 commit


05 Nov, 2014

1 commit

  • ovl_cache_put() can be called from ovl_dir_reset() if the cache needs to be
    rebuilt. We did list_del() on the cursor, which results in an Oops on the
    poisoned pointer in ovl_seek_cursor().

    Reported-by: Jordi Pujol Palomer
    Signed-off-by: Miklos Szeredi
    Tested-by: Jordi Pujol Palomer
    Signed-off-by: Al Viro

    Miklos Szeredi
     

04 Nov, 2014

1 commit

  • If we hit any errors in btrfs_lookup_csums_range, we'll loop through all
    the csums we allocate and free them. But the code was using list_entry
    incorrectly, and ended up trying to free the on-stack list_head instead.

    This bug came from commit 0678b6185

    btrfs: Don't BUG_ON kzalloc error in btrfs_lookup_csums_range()

    Signed-off-by: Chris Mason
    Reported-by: Erik Berg
    cc: stable@vger.kernel.org # 3.3 or newer

    Chris Mason
     

03 Nov, 2014

1 commit

  • Pull VFS fixes from Al Viro:
    "A bunch of assorted fixes, most of them followups to overlayfs merge"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    ovl: initialize ->is_cursor
    Return short read or 0 at end of a raw device, not EIO
    isofs: don't bother with ->d_op for normal case
    isofs_cmp(): we'll never see a dentry for . or ..
    overlayfs: fix lockdep misannotation
    ovl: fix check for cursor
    overlayfs: barriers for opening upper-layer directory
    rcu: Provide counterpart to rcu_dereference() for non-RCU situations
    staging: android: logger: Fix log corruption regression

    Linus Torvalds
     

02 Nov, 2014

1 commit

  • Pull btrfs fixes from Chris Mason:
    "Filipe is nailing down some problems with our skinny extent variation,
    and Dave's patch fixes endian problems in the new super block checks"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items
    Btrfs: properly clean up btrfs_end_io_wq_cache
    Btrfs: fix invalid leaf slot access in btrfs_lookup_extent()
    btrfs: use macro accessors in superblock validation checks

    Linus Torvalds
     

01 Nov, 2014

3 commits


31 Oct, 2014

3 commits

  • Author: David Jeffery
    Changes to the basic direct I/O code have broken the raw driver when reading
    to the end of a raw device. Instead of returning a short read for a read that
    extends partially beyond the device's end or 0 when at the end of the device,
    these reads now return EIO.

    The raw driver needs the same end of device handling as was added for normal
    block devices. Using blkdev_read_iter, which has the needed size checks,
    prevents the EIO conditions at the end of the device.

    Signed-off-by: David Jeffery
    Signed-off-by: Al Viro

    David Jeffery
     
  • we only need it for joliet and case-insensitive mounts

    Signed-off-by: Al Viro

    Al Viro
     
  • The man page for open(2) indicates that when O_CREAT is specified, the
    'mode' argument applies only to future accesses to the file:

    Note that this mode applies only to future accesses of the newly
    created file; the open() call that creates a read-only file
    may well return a read/write file descriptor.

    The man page for open(2) implies that 'mode' is treated identically by
    O_CREAT and O_TMPFILE.

    O_TMPFILE, however, behaves differently:

    int fd = open("/tmp", O_TMPFILE | O_RDWR, 0);
    assert(fd == -1);
    assert(errno == EACCES);

    int fd = open("/tmp", O_TMPFILE | O_RDWR, 0600);
    assert(fd > 0);

    For O_CREAT, do_last() sets acc_mode to MAY_OPEN only:

    if (*opened & FILE_CREATED) {
    /* Don't check for write permission, don't truncate */
    open_flag &= ~O_TRUNC;
    will_truncate = false;
    acc_mode = MAY_OPEN;
    path_to_nameidata(path, nd);
    goto finish_open_created;
    }

    But for O_TMPFILE, do_tmpfile() passes the full op->acc_mode to
    may_open().

    This patch lines up the behavior of O_TMPFILE with O_CREAT. After the
    inode is created, may_open() is called with acc_mode = MAY_OPEN, in
    do_tmpfile().

    A different, but related glibc bug revealed the discrepancy:
    https://sourceware.org/bugzilla/show_bug.cgi?id=17523

    The glibc lazily loads the 'mode' argument of open() and openat() using
    va_arg() only if O_CREAT is present in 'flags' (to support both the 2
    argument and the 3 argument forms of open; same idea for openat()).
    However, the glibc ignores the 'mode' argument if O_TMPFILE is in
    'flags'.

    On x86_64, for open(), it magically works anyway, as 'mode' is in
    RDX when entering open(), and is still in RDX on SYSCALL, which is where
    the kernel looks for the 3rd argument of a syscall.

    But openat() is not quite so lucky: 'mode' is in RCX when entering the
    glibc wrapper for openat(), while the kernel looks for the 4th argument
    of a syscall in R10. Indeed, the syscall calling convention differs from
    the regular calling convention in this respect on x86_64. So the kernel
    sees mode = 0 when trying to use glibc openat() with O_TMPFILE, and
    fails with EACCES.

    Signed-off-by: Eric Rannaud
    Acked-by: Andy Lutomirski
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Eric Rannaud
     

30 Oct, 2014

16 commits

  • ext4_ext_convert_to_initialized() can return more blocks than are
    actually allocated from map->m_lblk in case where initial part of the
    on-disk extent is zeroed out. Luckily this doesn't have serious
    consequences because the caller currently uses the return value
    only to unmap metadata buffers. Anyway this is a data
    corruption/exposure problem waiting to happen so fix it.

    Coverity-id: 1226848
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • When clearing inode journal flag, we call jbd2_journal_flush() to force
    all the journalled data to their final locations. Currently we ignore
    when this fails and continue clearing inode journal flag. This isn't a
    big problem because when jbd2_journal_flush() fails, journal is likely
    aborted anyway. But it can still lead to somewhat confusing results so
    rather bail out early.

    Coverity-id: 989044
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • When ext4_handle_dirty_dx_node() or ext4_handle_dirty_dirent_node()
    fail, there's really something wrong with the fs and there's no point in
    continuing further. Just return error from make_indexed_dir() in that
    case. Also initialize frames array so that if we return early due to
    error, dx_release() doesn't try to dereference uninitialized memory
    (which could happen also due to error in do_split()).

    Coverity-id: 741300
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Jan Kara
     
  • The old hash function didn't work well for 64-bit block numbers, and
    used undefined (negative) shift right behavior. Use the generic
    64-bit hash function instead.

    Signed-off-by: Theodore Ts'o
    Reported-by: Andrey Ryabinin

    Theodore Ts'o
     
  • O_DIRECT flags can be toggeled via fcntl(F_SETFL). But this value checked
    twice inside ext4_file_write_iter() and __generic_file_write() which
    result in BUG_ON inside ext4_direct_IO.

    Let's initialize iocb->private unconditionally.

    TESTCASE: xfstest:generic/036 https://patchwork.ozlabs.org/patch/402445/

    #TYPICAL STACK TRACE:
    kernel BUG at fs/ext4/inode.c:2960!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror dm_region_hash dm_log dm_mod
    CPU: 6 PID: 5505 Comm: aio-dio-fcntl-r Not tainted 3.17.0-rc2-00176-gff5c017 #161
    Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011
    task: ffff88080e95a7c0 ti: ffff88080f908000 task.ti: ffff88080f908000
    RIP: 0010:[] [] ext4_direct_IO+0x162/0x3d0
    RSP: 0018:ffff88080f90bb58 EFLAGS: 00010246
    RAX: 0000000000000400 RBX: ffff88080fdb2a28 RCX: 00000000a802c818
    RDX: 0000040000080000 RSI: ffff88080d8aeb80 RDI: 0000000000000001
    RBP: ffff88080f90bbc8 R08: 0000000000000000 R09: 0000000000001581
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff88080d8aeb80
    R13: ffff88080f90bbf8 R14: ffff88080fdb28c8 R15: ffff88080fdb2a28
    FS: 00007f23b2055700(0000) GS:ffff880818400000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f23b2045000 CR3: 000000080cedf000 CR4: 00000000000407e0
    Stack:
    ffff88080f90bb98 0000000000000000 7ffffffffffffffe ffff88080fdb2c30
    0000000000000200 0000000000000200 0000000000000001 0000000000000200
    ffff88080f90bbc8 ffff88080fdb2c30 ffff88080f90be08 0000000000000200
    Call Trace:
    [] generic_file_direct_write+0xed/0x180
    [] __generic_file_write_iter+0x222/0x370
    [] ext4_file_write_iter+0x34b/0x400
    [] ? aio_run_iocb+0x239/0x410
    [] ? aio_run_iocb+0x239/0x410
    [] ? local_clock+0x25/0x30
    [] ? __lock_acquire+0x274/0x700
    [] ? ext4_unwritten_wait+0xb0/0xb0
    [] aio_run_iocb+0x286/0x410
    [] ? local_clock+0x25/0x30
    [] ? lock_release_holdtime+0x29/0x190
    [] ? lookup_ioctx+0x4b/0xf0
    [] do_io_submit+0x55b/0x740
    [] ? do_io_submit+0x3ca/0x740
    [] SyS_io_submit+0x10/0x20
    [] system_call_fastpath+0x16/0x1b
    Code: 01 48 8b 80 f0 01 00 00 48 8b 18 49 8b 45 10 0f 85 f1 01 00 00 48 03 45 c8 48 3b 43 48 0f 8f e3 01 00 00 49 83 7c
    24 18 00 75 04 0b eb fe f0 ff 83 ec 01 00 00 49 8b 44 24 18 8b 00 85 c0 89
    RIP [] ext4_direct_IO+0x162/0x3d0
    RSP

    Reported-by: Sasha Levin
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Dmitry Monakhov
    Cc: stable@vger.kernel.org

    Dmitry Monakhov
     
  • If we can't load the journal, remove the procfs files for the extent
    status information file to avoid leaking resources.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Darrick J. Wong
     
  • ext4 does not permit changing the metadata or journal checksum feature
    flag while mounted. Until we decide to support that, don't allow a
    remount to change the journal_csum flag (right now we silently fail to
    change anything).

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Theodore Ts'o

    Darrick J. Wong
     
  • If metadata checksumming is turned on for the FS, we need to tell the
    journal to use checksumming too.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Darrick J. Wong
     
  • When we fail to load block bitmap in __ext4_new_inode() we will
    dereference NULL pointer in ext4_journal_get_write_access(). So check
    for error from ext4_read_block_bitmap().

    Coverity-id: 989065
    Cc: stable@vger.kernel.org
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • When there are no meta block groups update_backups() will compute the
    backup block in 32-bit arithmetics thus possibly overflowing the block
    number and corrupting the filesystem. OTOH filesystems without meta
    block groups larger than 16 TB should be rare. Fix the problem by doing
    the counting in 64-bit arithmetics.

    Coverity-id: 741252
    CC: stable@vger.kernel.org
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Lukas Czerner

    Jan Kara
     
  • Merge misc fixes from Andrew Morton:
    "21 fixes"

    * emailed patches from Andrew Morton : (21 commits)
    mm/balloon_compaction: fix deflation when compaction is disabled
    sh: fix sh770x SCIF memory regions
    zram: avoid NULL pointer access in concurrent situation
    mm/slab_common: don't check for duplicate cache names
    ocfs2: fix d_splice_alias() return code checking
    mm: rmap: split out page_remove_file_rmap()
    mm: memcontrol: fix missed end-writeback page accounting
    mm: page-writeback: inline account_page_dirtied() into single caller
    lib/bitmap.c: fix undefined shift in __bitmap_shift_{left|right}()
    drivers/rtc/rtc-bq32k.c: fix register value
    memory-hotplug: clear pgdat which is allocated by bootmem in try_offline_node()
    drivers/rtc/rtc-s3c.c: fix initialization failure without rtc source clock
    kernel/kmod: fix use-after-free of the sub_info structure
    drivers/rtc/rtc-pm8xxx.c: rework to support pm8941 rtc
    mm, thp: fix collapsing of hugepages on madvise
    drivers: of: add return value to of_reserved_mem_device_init()
    mm: free compound page with correct order
    gcov: add ARM64 to GCOV_PROFILE_ALL
    fsnotify: next_i is freed during fsnotify_unmount_inodes.
    mm/compaction.c: avoid premature range skip in isolate_migratepages_range
    ...

    Linus Torvalds
     
  • The zero range operation is analogous to fallocate with the exception of
    converting the range to zeroes. E.g., it attempts to allocate zeroed
    blocks over the range specified by the caller. The XFS implementation
    kills all delalloc blocks currently over the aligned range, converts the
    range to allocated zero blocks (unwritten extents) and handles the
    partial pages at the ends of the range by sending writes through the
    pagecache.

    The current implementation suffers from several problems associated with
    inode size. If the aligned range covers an extending I/O, said I/O is
    discarded and an inode size update from a previous write never makes it
    to disk. Further, if an unaligned zero range extends beyond eof, the
    page write induced for the partial end page can itself increase the
    inode size, even if the zero range request is not supposed to update
    i_size (via KEEP_SIZE, similar to an fallocate beyond EOF).

    The latter behavior not only incorrectly increases the inode size, but
    can lead to stray delalloc blocks on the inode. Typically, post-eof
    preallocation blocks are either truncated on release or inode eviction
    or explicitly written to by xfs_zero_eof() on natural file size
    extension. If the inode size increases due to zero range, however,
    associated blocks leak into the address space having never been
    converted or mapped to pagecache pages. A direct I/O to such an
    uncovered range cannot convert the extent via writeback and will BUG().
    For example:

    $ xfs_io -fc "pwrite 0 128k" -c "fzero -k 1m 54321"
    ...
    $ xfs_io -d -c "pread 128k 128k"

    If the entire delalloc extent happens to not have page coverage
    whatsoever (e.g., delalloc conversion couldn't find a large enough free
    space extent), even a full file writeback won't convert what's left of
    the extent and we'll assert on inode eviction.

    Rework xfs_zero_file_space() to avoid buffered I/O for partial pages.
    Use the existing hole punch and prealloc mechanisms as primitives for
    zero range. This implementation is not efficient nor ideal as we
    writeback dirty data over the range and remove existing extents rather
    than convert to unwrittern. The former writeback, however, is currently
    the only mechanism available to ensure consistency between pagecache and
    extent state. Even a pagecache truncate/delalloc punch prior to hole
    punch has lead to inconsistencies due to racing with writeback.

    This provides a consistent, correct implementation of zero range that
    survives fsstress/fsx testing without assert failures. The
    implementation can be optimized from this point forward once the
    fundamental issue of pagecache and delalloc extent state consistency is
    addressed.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • xfs_bulkstat() doesn't check error return from xfs_btree_increment(). In
    case of specific fs corruption that could result in xfs_bulkstat()
    entering an infinite loop because we would be looping over the same
    chunk over and over again. Fix the problem by checking the return value
    and terminating the loop properly.

    Coverity-id: 1231338
    cc:
    Signed-off-by: Jan Kara
    Reviewed-by: Jie Liu
    Signed-off-by: Dave Chinner

    Jan Kara
     
  • d_splice_alias() can return a valid dentry, NULL or an ERR_PTR.
    Currently the code checks not for ERR_PTR and will cuase an oops in
    ocfs2_dentry_attach_lock(). Fix this by using IS_ERR_OR_NULL().

    Signed-off-by: Richard Weinberger
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Weinberger
     
  • During file system stress testing on 3.10 and 3.12 based kernels, the
    umount command occasionally hung in fsnotify_unmount_inodes in the
    section of code:

    spin_lock(&inode->i_lock);
    if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
    spin_unlock(&inode->i_lock);
    continue;
    }

    As this section of code holds the global inode_sb_list_lock, eventually
    the system hangs trying to acquire the lock.

    Multiple crash dumps showed:

    The inode->i_state == 0x60 and i_count == 0 and i_sb_list would point
    back at itself. As this is not the value of list upon entry to the
    function, the kernel never exits the loop.

    To help narrow down problem, the call to list_del_init in
    inode_sb_list_del was changed to list_del. This poisons the pointers in
    the i_sb_list and causes a kernel to panic if it transverse a freed
    inode.

    Subsequent stress testing paniced in fsnotify_unmount_inodes at the
    bottom of the list_for_each_entry_safe loop showing next_i had become
    free.

    We believe the root cause of the problem is that next_i is being freed
    during the window of time that the list_for_each_entry_safe loop
    temporarily releases inode_sb_list_lock to call fsnotify and
    fsnotify_inode_delete.

    The code in fsnotify_unmount_inodes attempts to prevent the freeing of
    inode and next_i by calling __iget. However, the code doesn't do the
    __iget call on next_i

    if i_count == 0 or
    if i_state & (I_FREEING | I_WILL_FREE)

    The patch addresses this issue by advancing next_i in the above two cases
    until we either find a next_i which we can __iget or we reach the end of
    the list. This makes the handling of next_i more closely match the
    handling of the variable "inode."

    The time to reproduce the hang is highly variable (from hours to days.) We
    ran the stress test on a 3.10 kernel with the proposed patch for a week
    without failure.

    During list_for_each_entry_safe, next_i is becoming free causing
    the loop to never terminate. Advance next_i in those cases where
    __iget is not done.

    Signed-off-by: Jerry Hoemann
    Cc: Jeff Kirsher
    Cc: Ken Helias
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerry Hoemann
     
  • Pull block layer fixes from Jens Axboe:
    "A small collection of fixes for the current kernel. This contains:

    - Two error handling fixes from Jan Kara. One for null_blk on
    failure to add a device, and the other for the block/scsi_ioctl
    SCSI_IOCTL_SEND_COMMAND fixing up the error jump point.

    - A commit added in the merge window for the bio integrity bits
    unfortunately disabled merging for all requests if
    CONFIG_BLK_DEV_INTEGRITY wasn't set. Reverse the logic, so that
    integrity checking wont disallow merges when not enabled.

    - A fix from Ming Lei for merging and generating too many segments.
    This caused a BUG in virtio_blk.

    - Two error handling printk() fixups from Robert Elliott, improving
    the information given when we rate limit.

    - Error handling fixup on elevator_init() failure from Sudip
    Mukherjee.

    - A fix from Tony Battersby, fixing up a memory leak in the
    scatterlist handling with scsi-mq"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    block: Fix merge logic when CONFIG_BLK_DEV_INTEGRITY is not defined
    lib/scatterlist: fix memory leak with scsi-mq
    block: fix wrong error return in elevator_init()
    scsi: Fix error handling in SCSI_IOCTL_SEND_COMMAND
    null_blk: Cleanup error recovery in null_add_dev()
    blk-merge: recaculate segment if it isn't less than max segments
    fs: clarify rate limit suppressed buffer I/O errors
    fs: merge I/O error prints into one line

    Linus Torvalds
     

29 Oct, 2014

5 commits