17 May, 2015

3 commits

  • Pull UML hostfs fix from Richard Weinberger:
    "This contains a single fix for a regression introduced in 4.1-rc1"

    * 'for-linus-4.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
    hostfs: Use correct mask for file mode

    Linus Torvalds
     
  • Pull ext4 fixes from Ted Ts'o:
    "Fix a number of ext4 bugs; the most serious of which is a bug in the
    lazytime mount optimization code where we could end up updating the
    timestamps to the wrong inode"

    * tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: fix an ext3 collapse range regression in xfstests
    jbd2: fix r_count overflows leading to buffer overflow in journal recovery
    ext4: check for zero length extent explicitly
    ext4: fix NULL pointer dereference when journal restart fails
    ext4: remove unused function prototype from ext4.h
    ext4: don't save the error information if the block device is read-only
    ext4: fix lazytime optimization

    Linus Torvalds
     
  • Pull btrfs fixes from Chris Mason:
    "The first commit is a fix from Filipe for a very old extent buffer
    reuse race that triggered a BUG_ON. It hasn't come up often, I looked
    through old logs at FB and we hit it a handful of times over the last
    year.

    The rest are other corners he hit during testing"

    * 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: fix race when reusing stale extent buffers that leads to BUG_ON
    Btrfs: fix race between block group creation and their cache writeout
    Btrfs: fix panic when starting bg cache writeout after IO error
    Btrfs: fix crash after inode cache writeback failure

    Linus Torvalds
     

16 May, 2015

1 commit

  • Pull parisc fixes from Helge Deller:
    "One important patch which fixes crashes due to stack randomization on
    architectures where the stack grows upwards (currently parisc and
    metag only).

    This bug went unnoticed on parisc since kernel 3.14 where the flexible
    mmap memory layout support was added by commit 9dabf60dc4ab. The
    changes in fs/exec.c are inside an #ifdef CONFIG_STACK_GROWSUP section
    and will not affect other platforms.

    The other two patches rename args of the kthread_arg() function and
    fixes a printk output"

    * 'parisc-4.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
    parisc,metag: Fix crashes due to stack randomization on stack-grows-upwards architectures
    parisc: copy_thread(): rename 'arg' argument to 'kthread_arg'
    parisc: %pf is only for function pointers

    Linus Torvalds
     

15 May, 2015

8 commits

  • The xfstests test suite assumes that an attempt to collapse range on
    the range (0, 1) will return EOPNOTSUPP if the file system does not
    support collapse range. Commit 280227a75b56: "ext4: move check under
    lock scope to close a race" broke this, and this caused xfstests to
    fail when run when testing file systems that did not have the extents
    feature enabled.

    Reported-by: Eric Whitney
    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     
  • root->ino_ida is used for kernfs inode number allocations. Since IDA has
    a layered structure, different IDs can reside on the same layer, which
    is currently accounted to some memory cgroup. The problem is that each
    kmem cache of a memory cgroup has its own directory on sysfs (under
    /sys/fs/kernel//cgroup). If the inode number of such a
    directory or any file in it gets allocated from a layer accounted to the
    cgroup which the cache is created for, the cgroup will get pinned for
    good, because one has to free all kmem allocations accounted to a cgroup
    in order to release it and destroy all its kmem caches. That said we
    must not account layers of ino_ida to any memory cgroup.

    Since per net init operations may create new sysfs entries directly
    (e.g. lo device) or indirectly (nf_conntrack creates a new kmem cache
    per each namespace, which, in turn, creates new sysfs entries), an easy
    way to reproduce this issue is by creating network namespace(s) from
    inside a kmem-active memory cgroup.

    Signed-off-by: Vladimir Davydov
    Acked-by: Tejun Heo
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Greg Thelen
    Cc: Greg Kroah-Hartman
    Cc: [4.0.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • The journal revoke block recovery code does not check r_count for
    sanity, which means that an evil value of r_count could result in
    the kernel reading off the end of the revoke table and into whatever
    garbage lies beyond. This could crash the kernel, so fix that.

    However, in testing this fix, I discovered that the code to write
    out the revoke tables also was not correctly checking to see if the
    block was full -- the current offset check is fine so long as the
    revoke table space size is a multiple of the record size, but this
    is not true when either journal_csum_v[23] are set.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara
    Cc: stable@vger.kernel.org

    Darrick J. Wong
     
  • The following commit introduced a bug when checking for zero length extent

    5946d08 ext4: check for overlapping extents in ext4_valid_extent_entries()

    Zero length extent could pass the check if lblock is zero.

    Adding the explicit check for zero length back.

    Signed-off-by: Eryu Guan
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Eryu Guan
     
  • Currently when journal restart fails, we'll have the h_transaction of
    the handle set to NULL to indicate that the handle has been effectively
    aborted. We handle this situation quietly in the jbd2_journal_stop() and just
    free the handle and exit because everything else has been done before we
    attempted (and failed) to restart the journal.

    Unfortunately there are a number of problems with that approach
    introduced with commit

    41a5b913197c "jbd2: invalidate handle if jbd2_journal_restart()
    fails"

    First of all in ext4 jbd2_journal_stop() will be called through
    __ext4_journal_stop() where we would try to get a hold of the superblock
    by dereferencing h_transaction which in this case would lead to NULL
    pointer dereference and crash.

    In addition we're going to free the handle regardless of the refcount
    which is bad as well, because others up the call chain will still
    reference the handle so we might potentially reference already freed
    memory.

    Moreover it's expected that we'll get aborted handle as well as detached
    handle in some of the journalling function as the error propagates up
    the stack, so it's unnecessary to call WARN_ON every time we get
    detached handle.

    And finally we might leak some memory by forgetting to free reserved
    handle in jbd2_journal_stop() in the case where handle was detached from
    the transaction (h_transaction is NULL).

    Fix the NULL pointer dereference in __ext4_journal_stop() by just
    calling jbd2_journal_stop() quietly as suggested by Jan Kara. Also fix
    the potential memory leak in jbd2_journal_stop() and use proper
    handle refcounting before we attempt to free it to avoid use-after-free
    issues.

    And finally remove all WARN_ON(!transaction) from the code so that we do
    not get random traces when something goes wrong because when journal
    restart fails we will get to some of those functions.

    Cc: stable@vger.kernel.org
    Signed-off-by: Lukas Czerner
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    Lukas Czerner
     
  • The ext4_extent_tree_init() function hasn't been in the ext4 code for
    a long time ago, except in an unused function prototype in ext4.h

    Google-Bug-Id: 4530137
    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     
  • Google-Bug-Id: 20939131
    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     
  • We had a fencepost error in the lazytime optimization which means that
    timestamp would get written to the wrong inode.

    Cc: stable@vger.kernel.org
    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     

13 May, 2015

1 commit

  • On architectures where the stack grows upwards (CONFIG_STACK_GROWSUP=y,
    currently parisc and metag only) stack randomization sometimes leads to crashes
    when the stack ulimit is set to lower values than STACK_RND_MASK (which is 8 MB
    by default if not defined in arch-specific headers).

    The problem is, that when the stack vm_area_struct is set up in fs/exec.c, the
    additional space needed for the stack randomization (as defined by the value of
    STACK_RND_MASK) was not taken into account yet and as such, when the stack
    randomization code added a random offset to the stack start, the stack
    effectively got smaller than what the user defined via rlimit_max(RLIMIT_STACK)
    which then sometimes leads to out-of-stack situations and crashes.

    This patch fixes it by adding the maximum possible amount of memory (based on
    STACK_RND_MASK) which theoretically could be added by the stack randomization
    code to the initial stack size. That way, the user-defined stack size is always
    guaranteed to be at minimum what is defined via rlimit_max(RLIMIT_STACK).

    This bug is currently not visible on the metag architecture, because on metag
    STACK_RND_MASK is defined to 0 which effectively disables stack randomization.

    The changes to fs/exec.c are inside an "#ifdef CONFIG_STACK_GROWSUP"
    section, so it does not affect other platformws beside those where the
    stack grows upwards (parisc and metag).

    Signed-off-by: Helge Deller
    Cc: linux-parisc@vger.kernel.org
    Cc: James Hogan
    Cc: linux-metag@vger.kernel.org
    Cc: stable@vger.kernel.org # v3.16+

    Helge Deller
     

12 May, 2015

1 commit

  • Pull nfsd bugfixes from Bruce Fields:
    "Mainly pnfs fixes (and for problems with generic callback code made
    more obvious by pnfs)"

    * 'for-4.1' of git://linux-nfs.org/~bfields/linux:
    nfsd: skip CB_NULL probes for 4.1 or later
    nfsd: fix callback restarts
    nfsd: split transport vs operation errors for callbacks
    svcrpc: fix potential GSSX_ACCEPT_SEC_CONTEXT decoding failures
    nfsd: fix pNFS return on close semantics
    nfsd: fix the check for confirmed openowner in nfs4_preprocess_stateid_op
    nfsd/blocklayout: pretend we can send deviceid notifications

    Linus Torvalds
     

11 May, 2015

4 commits

  • There's a race between releasing extent buffers that are flagged as stale
    and recycling them that makes us it the following BUG_ON at
    btrfs_release_extent_buffer_page:

    BUG_ON(extent_buffer_under_io(eb))

    The BUG_ON is triggered because the extent buffer has the flag
    EXTENT_BUFFER_DIRTY set as a consequence of having been reused and made
    dirty by another concurrent task.

    Here follows a sequence of steps that leads to the BUG_ON.

    CPU 0 CPU 1 CPU 2

    path->nodes[0] == eb X
    X->refs == 2 (1 for the tree, 1 for the path)
    btrfs_header_generation(X) == current trans id
    flag EXTENT_BUFFER_DIRTY set on X

    btrfs_release_path(path)
    unlocks X

    reads eb X
    X->refs incremented to 3
    locks eb X
    btrfs_del_items(X)
    X becomes empty
    clean_tree_block(X)
    clear EXTENT_BUFFER_DIRTY from X
    btrfs_del_leaf(X)
    unlocks X
    extent_buffer_get(X)
    X->refs incremented to 4
    btrfs_free_tree_block(X)
    X's range is not pinned
    X's range added to free
    space cache
    free_extent_buffer_stale(X)
    lock X->refs_lock
    set EXTENT_BUFFER_STALE on X
    release_extent_buffer(X)
    X->refs decremented to 3
    unlocks X->refs_lock
    btrfs_release_path()
    unlocks X
    free_extent_buffer(X)
    X->refs becomes 2

    __btrfs_cow_block(Y)
    btrfs_alloc_tree_block()
    btrfs_reserve_extent()
    find_free_extent()
    gets offset == X->start
    btrfs_init_new_buffer(X->start)
    btrfs_find_create_tree_block(X->start)
    alloc_extent_buffer(X->start)
    find_extent_buffer(X->start)
    finds eb X in radix tree

    free_extent_buffer(X)
    lock X->refs_lock
    test X->refs == 2
    test bit EXTENT_BUFFER_STALE is set
    test !extent_buffer_under_io(eb)

    increments X->refs to 3
    mark_extent_buffer_accessed(X)
    check_buffer_tree_ref(X)
    --> does nothing,
    X->refs >= 2 and
    EXTENT_BUFFER_TREE_REF
    is set in X
    clear EXTENT_BUFFER_STALE from X
    locks X
    btrfs_mark_buffer_dirty()
    set_extent_buffer_dirty(X)
    check_buffer_tree_ref(X)
    --> does nothing, X->refs >= 2 and
    EXTENT_BUFFER_TREE_REF is set
    sets EXTENT_BUFFER_DIRTY on X

    test and clear EXTENT_BUFFER_TREE_REF
    decrements X->refs to 2
    release_extent_buffer(X)
    decrements X->refs to 1
    unlock X->refs_lock

    unlock X
    free_extent_buffer(X)
    lock X->refs_lock
    release_extent_buffer(X)
    decrements X->refs to 0
    btrfs_release_extent_buffer_page(X)
    BUG_ON(extent_buffer_under_io(X))
    --> EXTENT_BUFFER_DIRTY set on X

    Fix this by making find_extent buffer wait for any ongoing task currently
    executing free_extent_buffer()/free_extent_buffer_stale() if the extent
    buffer has the stale flag set.
    A more clean alternative would be to always increment the extent buffer's
    reference count while holding its refs_lock spinlock but find_extent_buffer
    is a performance critical area and that would cause lock contention whenever
    multiple tasks search for the same extent buffer concurrently.

    A build server running a SLES 12 kernel (3.12 kernel + over 450 upstream
    btrfs patches backported from newer kernels) was hitting this often:

    [1212302.461948] kernel BUG at ../fs/btrfs/extent_io.c:4507!
    (...)
    [1212302.470219] CPU: 1 PID: 19259 Comm: bs_sched Not tainted 3.12.36-38-default #1
    [1212302.540792] Hardware name: Supermicro PDSM4/PDSM4, BIOS 6.00 04/17/2006
    [1212302.540792] task: ffff8800e07e0100 ti: ffff8800d6412000 task.ti: ffff8800d6412000
    [1212302.540792] RIP: 0010:[] [] btrfs_release_extent_buffer_page.constprop.51+0x101/0x110 [btrfs]
    (...)
    [1212302.630008] Call Trace:
    [1212302.630008] [] release_extent_buffer+0x3d/0xa0 [btrfs]
    [1212302.630008] [] btrfs_release_path+0x1d/0xa0 [btrfs]
    [1212302.630008] [] read_block_for_search.isra.33+0x13e/0x3a0 [btrfs]
    [1212302.630008] [] btrfs_search_slot+0x3f4/0xa80 [btrfs]
    [1212302.630008] [] lookup_inline_extent_backref+0xf8/0x630 [btrfs]
    [1212302.630008] [] __btrfs_free_extent+0x11d/0xc40 [btrfs]
    [1212302.630008] [] __btrfs_run_delayed_refs+0x394/0x11d0 [btrfs]
    [1212302.630008] [] btrfs_run_delayed_refs.part.66+0x69/0x280 [btrfs]
    [1212302.630008] [] __btrfs_end_transaction+0x2ad/0x3d0 [btrfs]
    [1212302.630008] [] btrfs_evict_inode+0x4a5/0x500 [btrfs]
    [1212302.630008] [] evict+0xa8/0x190
    [1212302.630008] [] do_unlinkat+0x1a0/0x2b0

    I was also able to reproduce this on a 3.19 kernel, corresponding to Chris'
    integration branch from about a month ago, running the following stress
    test on a qemu/kvm guest (with 4 virtual cpus and 16Gb of ram):

    while true; do
    mkfs.btrfs -l 4096 -f -b `expr 20 \* 1024 \* 1024 \* 1024` /dev/sdd
    mount /dev/sdd /mnt
    snapshot_cmd="btrfs subvolume snapshot -r /mnt"
    snapshot_cmd="$snapshot_cmd /mnt/snap_\`date +'%H_%M_%S_%N'\`"
    fsstress -d /mnt -n 25000 -p 8 -x "$snapshot_cmd" -X 100
    umount /mnt
    done

    Which usually triggers the BUG_ON within less than 24 hours:

    [49558.618097] ------------[ cut here ]------------
    [49558.619732] kernel BUG at fs/btrfs/extent_io.c:4551!
    (...)
    [49558.620031] CPU: 3 PID: 23908 Comm: fsstress Tainted: G W 3.19.0-btrfs-next-7+ #3
    [49558.620031] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
    [49558.620031] task: ffff8800319fc0d0 ti: ffff880220da8000 task.ti: ffff880220da8000
    [49558.620031] RIP: 0010:[] [] btrfs_release_extent_buffer_page+0x20/0xe9 [btrfs]
    (...)
    [49558.620031] Call Trace:
    [49558.620031] [] release_extent_buffer+0x90/0xd3 [btrfs]
    [49558.620031] [] ? _raw_spin_lock+0x3b/0x43
    [49558.620031] [] ? free_extent_buffer+0x37/0x94 [btrfs]
    [49558.620031] [] free_extent_buffer+0x90/0x94 [btrfs]
    [49558.620031] [] btrfs_release_path+0x4a/0x69 [btrfs]
    [49558.620031] [] __btrfs_free_extent+0x778/0x80c [btrfs]
    [49558.620031] [] __btrfs_run_delayed_refs+0xad2/0xc62 [btrfs]
    [49558.728054] [] ? kmemleak_alloc_recursive.constprop.52+0x16/0x18
    [49558.728054] [] btrfs_run_delayed_refs+0x6d/0x1ba [btrfs]
    [49558.728054] [] ? join_transaction.isra.9+0xb9/0x36b [btrfs]
    [49558.728054] [] btrfs_commit_transaction+0x4c/0x981 [btrfs]
    [49558.728054] [] btrfs_sync_fs+0xd5/0x10d [btrfs]
    [49558.728054] [] ? iterate_supers+0x60/0xc4
    [49558.728054] [] ? do_sync_work+0x91/0x91
    [49558.728054] [] sync_fs_one_sb+0x20/0x22
    [49558.728054] [] iterate_supers+0x76/0xc4
    [49558.728054] [] sys_sync+0x55/0x83
    [49558.728054] [] system_call_fastpath+0x12/0x17

    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • So creating a block group has 2 distinct phases:

    Phase 1 - creates the btrfs_block_group_cache item and adds it to the
    rbtree fs_info->block_group_cache_tree and to the corresponding list
    space_info->block_groups[];

    Phase 2 - adds the block group item to the extent tree and corresponding
    items to the chunk tree.

    The first phase adds the block_group_cache_item to a list of pending block
    groups in the transaction handle, and phase 2 happens when
    btrfs_end_transaction() is called against the transaction handle.

    It happens that once phase 1 completes, other concurrent tasks that use
    their own transaction handle, but points to the same running transaction
    (struct btrfs_trans_handle->transaction), can use this block group for
    space allocations and therefore mark it dirty. Dirty block groups are
    tracked in a list belonging to the currently running transaction (struct
    btrfs_transaction) and not in the transaction handle (btrfs_trans_handle).

    This is a problem because once a task calls btrfs_commit_transaction(),
    it calls btrfs_start_dirty_block_groups() which will see all dirty block
    groups and attempt to start their writeout, including those that are
    still attached to the transaction handle of some concurrent task that
    hasn't called btrfs_end_transaction() yet - which means those block
    groups haven't gone through phase 2 yet and therefore when
    write_one_cache_group() is called, it won't find the block group items
    in the extent tree and abort the current transaction with -ENOENT,
    turning the fs into readonly mode and require a remount.

    Fix this by ignoring -ENOENT when looking for block group items in the
    extent tree when we attempt to start the writeout of the block group
    caches outside the critical section of the transaction commit. We will
    try again later during the critical section and if there we still don't
    find the block group item in the extent tree, we then abort the current
    transaction.

    This issue happened twice, once while running fstests btrfs/067 and once
    for btrfs/078, which produced the following trace:

    [ 3278.703014] WARNING: CPU: 7 PID: 18499 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
    [ 3278.707329] BTRFS: Transaction aborted (error -2)
    (...)
    [ 3278.731555] Call Trace:
    [ 3278.732396] [] dump_stack+0x4f/0x7b
    [ 3278.733860] [] ? console_unlock+0x361/0x3ad
    [ 3278.735312] [] warn_slowpath_common+0xa1/0xbb
    [ 3278.736874] [] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
    [ 3278.738302] [] warn_slowpath_fmt+0x46/0x48
    [ 3278.739520] [] __btrfs_abort_transaction+0x52/0x114 [btrfs]
    [ 3278.741222] [] write_one_cache_group+0xae/0xbf [btrfs]
    [ 3278.742797] [] btrfs_start_dirty_block_groups+0x170/0x2b2 [btrfs]
    [ 3278.744492] [] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
    [ 3278.746084] [] ? trace_hardirqs_on+0xd/0xf
    [ 3278.747249] [] btrfs_sync_file+0x313/0x387 [btrfs]
    [ 3278.748744] [] vfs_fsync_range+0x95/0xa4
    [ 3278.749958] [] ? ret_from_sys_call+0x1d/0x58
    [ 3278.751218] [] vfs_fsync+0x1c/0x1e
    [ 3278.754197] [] do_fsync+0x34/0x4e
    [ 3278.755192] [] SyS_fsync+0x10/0x14
    [ 3278.756236] [] system_call_fastpath+0x12/0x17
    [ 3278.757366] ---[ end trace 9a4d4df4969709aa ]---

    Fixes: 1bbc621ef284 ("Btrfs: allow block group cache writeout
    outside critical section in commit")

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • When waiting for the writeback of block group cache we returned
    immediately if there was an error during writeback without waiting
    for the ordered extent to complete. This left a short time window
    where if some other task attempts to start the writeout for the same
    block group cache it can attempt to add a new ordered extent, starting
    at the same offset (0) before the previous one is removed from the
    ordered tree, causing an ordered tree panic (calls BUG()).

    This normally doesn't happen in other write paths, such as buffered
    writes or direct IO writes for regular files, since before marking
    page ranges dirty we lock the ranges and wait for any ordered extents
    within the range to complete first.

    Fix this by making btrfs_wait_ordered_range() not return immediately
    if it gets an error from the writeback, waiting for all ordered extents
    to complete first.

    This issue happened often when running the fstest btrfs/088 and it's
    easy to trigger it by running in a loop until the panic happens:

    for ((i = 1; i ] btrfs_add_ordered_extent+0x12/0x14 [btrfs]
    [17156.864052] [] run_delalloc_nocow+0x5bf/0x747 [btrfs]
    [17156.864052] [] run_delalloc_range+0x95/0x353 [btrfs]
    [17156.864052] [] writepage_delalloc.isra.16+0xb9/0x13f [btrfs]
    [17156.864052] [] __extent_writepage+0x129/0x1f7 [btrfs]
    [17156.864052] [] extent_write_cache_pages.isra.15.constprop.28+0x231/0x2f4 [btrfs]
    [17156.864052] [] ? __module_text_address+0x12/0x59
    [17156.864052] [] ? trace_hardirqs_on+0xd/0xf
    [17156.864052] [] extent_writepages+0x4b/0x5c [btrfs]
    [17156.864052] [] ? kmem_cache_free+0x9b/0xce
    [17156.864052] [] ? btrfs_submit_direct+0x3fc/0x3fc [btrfs]
    [17156.864052] [] ? free_extent_state+0x8c/0xc1 [btrfs]
    [17156.864052] [] btrfs_writepages+0x28/0x2a [btrfs]
    [17156.864052] [] do_writepages+0x23/0x2c
    [17156.864052] [] __filemap_fdatawrite_range+0x5a/0x61
    [17156.864052] [] filemap_fdatawrite_range+0x13/0x15
    [17156.864052] [] btrfs_fdatawrite_range+0x21/0x48 [btrfs]
    [17156.864052] [] __btrfs_write_out_cache.isra.14+0x2d9/0x3a7 [btrfs]
    [17156.864052] [] ? btrfs_write_out_cache+0x41/0xdc [btrfs]
    [17156.864052] [] btrfs_write_out_cache+0x93/0xdc [btrfs]
    [17156.864052] [] ? btrfs_start_dirty_block_groups+0x13a/0x2b2 [btrfs]
    [17156.864052] [] btrfs_start_dirty_block_groups+0x1d9/0x2b2 [btrfs]
    [17156.864052] [] ? trace_hardirqs_on+0xd/0xf
    [17156.864052] [] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
    [17156.864052] [] btrfs_sync_fs+0xe1/0x12d [btrfs]

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • If the writeback of an inode cache failed we were unnecessarilly
    attempting to release again the delalloc metadata that we previously
    reserved. However attempting to do this a second time triggers an
    assertion at drop_outstanding_extent() because we have no more
    outstanding extents for our inode cache's inode. If we were able
    to start writeback of the cache the reserved metadata space is
    released at btrfs_finished_ordered_io(), even if an error happens
    during writeback.

    So make sure we don't repeat the metadata space release if writeback
    started for our inode cache.

    This issue was trivial to reproduce by running the fstest btrfs/088
    with "-o inode_cache", which triggered the assertion leading to a
    BUG() call and requiring a reboot in order to run the remaining
    fstests. Trace produced by btrfs/088:

    [255289.385904] BTRFS: assertion failed: BTRFS_I(inode)->outstanding_extents >= num_extents, file: fs/btrfs/extent-tree.c, line: 5276
    [255289.388094] ------------[ cut here ]------------
    [255289.389184] kernel BUG at fs/btrfs/ctree.h:4057!
    [255289.390125] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    (...)
    [255289.392068] Call Trace:
    [255289.392068] [] drop_outstanding_extent+0x3d/0x6d [btrfs]
    [255289.392068] [] btrfs_delalloc_release_metadata+0x54/0xe3 [btrfs]
    [255289.392068] [] btrfs_write_out_ino_cache+0x95/0xad [btrfs]
    [255289.392068] [] btrfs_save_ino_cache+0x275/0x2dc [btrfs]
    [255289.392068] [] commit_fs_roots.isra.12+0xaa/0x137 [btrfs]
    [255289.392068] [] ? trace_hardirqs_on+0xd/0xf
    [255289.392068] [] ? btrfs_commit_transaction+0x4b1/0x9c9 [btrfs]
    [255289.392068] [] ? _raw_spin_unlock+0x32/0x46
    [255289.392068] [] btrfs_commit_transaction+0x4c0/0x9c9 [btrfs]
    (...)

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     

10 May, 2015

2 commits


09 May, 2015

5 commits

  • Pull vfs fixes from Al Viro:
    "A couple of fixes for bugs caught while digging in fs/namei.c. The
    first one is this cycle regression, the second is 3.11 and later"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    path_openat(): fix double fput()
    namei: d_is_negative() should be checked before ->d_seq validation

    Linus Torvalds
     
  • path_openat() jumps to the wrong place after do_tmpfile() - it has
    already done path_cleanup() (as part of path_lookupat() called by
    do_tmpfile()), so doing that again can lead to double fput().

    Cc: stable@vger.kernel.org # v3.11+
    Signed-off-by: Al Viro

    Al Viro
     
  • Fetching ->d_inode, verifying ->d_seq and finding d_is_negative() to
    be true does *not* mean that inode we'd fetched had been NULL - that
    holds only while ->d_seq is still unchanged.

    Shift d_is_negative() checks into lookup_fast() prior to ->d_seq
    verification.

    Reported-by: Steven Rostedt
    Tested-by: Steven Rostedt
    Signed-off-by: Al Viro

    Al Viro
     
  • Pull btrfs fix from Chris Mason:
    "When an arm user reported crashes near page_address(page) in my new
    code, it became clear that I can't be trusted with GFP masks. Filipe
    beat me to the patch, and I'll just be in the corner with my dunce cap
    on"

    * 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: fix wrong mapping flags for free space inode

    Linus Torvalds
     
  • Pull block fixes from Jens Axboe:
    "A collection of fixes since the merge window;

    - fix for a double elevator module release, from Chao Yu. Ancient bug.

    - the splice() MORE flag fix from Christophe Leroy.

    - a fix for NVMe, fixing a patch that went in in the merge window.
    From Keith.

    - two fixes for blk-mq CPU hotplug handling, from Ming Lei.

    - bdi vs blockdev lifetime fix from Neil Brown, fixing and oops in md.

    - two blk-mq fixes from Shaohua, fixing a race on queue stop and a
    bad merge issue with FUA writes.

    - division-by-zero fix for writeback from Tejun.

    - a block bounce page accounting fix, making sure we inc/dec after
    bouncing so that pre/post IO pages match up. From Wang YanQing"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    splice: sendfile() at once fails for big files
    blk-mq: don't lose requests if a stopped queue restarts
    blk-mq: fix FUA request hang
    block: destroy bdi before blockdev is unregistered.
    block:bounce: fix call inc_|dec_zone_page_state on different pages confuse value of NR_BOUNCE
    elevator: fix double release of elevator module
    writeback: use |1 instead of +1 to protect against div by zero
    blk-mq: fix CPU hotplug handling
    blk-mq: fix race between timeout and CPU hotplug
    NVMe: Fix VPD B0 max sectors translation

    Linus Torvalds
     

08 May, 2015

1 commit


07 May, 2015

2 commits

  • We were passing a flags value that differed from the intention in commit
    2b108268006e ("Btrfs: don't use highmem for free space cache pages").

    This caused problems in a ARM machine, leaving btrfs unusable there.

    Reported-by: Merlijn Wajer
    Tested-by: Merlijn Wajer
    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • Pull x86 fixes from Ingo Molnar:
    "EFI fixes, and FPU fix, a ticket spinlock boundary condition fix and
    two build fixes"

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/fpu: Always restore_xinit_state() when use_eager_cpu()
    x86: Make cpu_tss available to external modules
    efi: Fix error handling in add_sysfs_runtime_map_entry()
    x86/spinlocks: Fix regression in spinlock contention detection
    x86/mm: Clean up types in xlate_dev_mem_ptr()
    x86/efi: Store upper bits of command line buffer address in ext_cmd_line_ptr
    efivarfs: Ensure VariableName is NUL-terminated

    Linus Torvalds
     

06 May, 2015

5 commits

  • Using sendfile with below small program to get MD5 sums of some files,
    it appear that big files (over 64kbytes with 4k pages system) get a
    wrong MD5 sum while small files get the correct sum.
    This program uses sendfile() to send a file to an AF_ALG socket
    for hashing.

    /* md5sum2.c */
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(int argc, char **argv)
    {
    int sk = socket(AF_ALG, SOCK_SEQPACKET, 0);
    struct stat st;
    struct sockaddr_alg sa = {
    .salg_family = AF_ALG,
    .salg_type = "hash",
    .salg_name = "md5",
    };
    int n;

    bind(sk, (struct sockaddr*)&sa, sizeof(sa));

    for (n = 1; n < argc; n++) {
    int size;
    int offset = 0;
    char buf[4096];
    int fd;
    int sko;
    int i;

    fd = open(argv[n], O_RDONLY);
    sko = accept(sk, NULL, 0);
    fstat(fd, &st);
    size = st.st_size;
    sendfile(sko, fd, &offset, size);
    size = read(sko, buf, sizeof(buf));
    for (i = 0; i < size; i++)
    printf("%2.2x", buf[i]);
    printf(" %s\n", argv[n]);
    close(fd);
    close(sko);
    }
    exit(0);
    }

    Test below is done using official linux patch files. First result is
    with a software based md5sum. Second result is with the program above.

    root@vgoip:~# ls -l patch-3.6.*
    -rw-r--r-- 1 root root 64011 Aug 24 12:01 patch-3.6.2.gz
    -rw-r--r-- 1 root root 94131 Aug 24 12:01 patch-3.6.3.gz

    root@vgoip:~# md5sum patch-3.6.*
    b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz
    c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz

    root@vgoip:~# ./md5sum2 patch-3.6.*
    b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz
    5fd77b24e68bb24dcc72d6e57c64790e patch-3.6.3.gz

    After investivation, it appears that sendfile() sends the files by blocks
    of 64kbytes (16 times PAGE_SIZE). The problem is that at the end of each
    block, the SPLICE_F_MORE flag is missing, therefore the hashing operation
    is reset as if it was the end of the file.

    This patch adds SPLICE_F_MORE to the flags when more data is pending.

    With the patch applied, we get the correct sums:

    root@vgoip:~# md5sum patch-3.6.*
    b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz
    c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz

    root@vgoip:~# ./md5sum2 patch-3.6.*
    b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz
    c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz

    Signed-off-by: Christophe Leroy
    Signed-off-by: Jens Axboe

    Christophe Leroy
     
  • Pull EFI fixes from Matt Fleming:

    * Avoid garbage names in efivarfs due to buggy firmware by zeroing
    EFI variable name. (Ross Lagerwall)

    * Stop erroneously dropping upper 32 bits of boot command line pointer
    in EFI boot stub and stash them in ext_cmd_line_ptr. (Roy Franz)

    * Fix double-free bug in error handling code path of EFI runtime map
    code. (Dan Carpenter)

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • There is a race window in dlm_get_lock_resource(), which may return a
    lock resource which has been purged. This will cause the process to
    hang forever in dlmlock() as the ast msg can't be handled due to its
    lock resource not existing.

    dlm_get_lock_resource {
    ...
    spin_lock(&dlm->spinlock);
    tmpres = __dlm_lookup_lockres_full(dlm, lockid, namelen, hash);
    if (tmpres) {
    spin_unlock(&dlm->spinlock);
    >>>>>>>> race window, dlm_run_purge_list() may run and purge
    the lock resource
    spin_lock(&tmpres->spinlock);
    ...
    spin_unlock(&tmpres->spinlock);
    }
    }

    Signed-off-by: Junxiao Bi
    Cc: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • The range check for b-tree level parameter in nilfs_btree_root_broken()
    is wrong; it accepts the case of "level == NILFS_BTREE_LEVEL_MAX" even
    though the level is limited to values in the range of 0 to
    (NILFS_BTREE_LEVEL_MAX - 1).

    Since the level parameter is read from storage device and used to index
    nilfs_btree_path array whose element count is NILFS_BTREE_LEVEL_MAX, it
    can cause memory overrun during btree operations if the boundary value
    is set to the level parameter on device.

    This fixes the broken sanity check and adds a comment to clarify that
    the upper bound NILFS_BTREE_LEVEL_MAX is exclusive.

    Signed-off-by: Ryusuke Konishi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ryusuke Konishi
     
  • We need this earlier in the boot process to allow various subsystems to
    use configfs (e.g Industrial IIO).

    Also, debugfs is at core_initcall level and configfs should be on the same
    level from infrastructure point of view.

    Signed-off-by: Daniel Baluta
    Suggested-by: Lars-Peter Clausen
    Reviewed-by: Christoph Hellwig
    Cc: Al Viro
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Baluta
     

05 May, 2015

7 commits

  • The page_follow_link_light returns NULL and its error pointer was remained
    in nd->path.

    Reported-by: Dan Carpenter
    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • This reports performance regression by Yuanhan Liu.
    The basic idea was to reduce one-point mutex, but it turns out this causes
    another contention like context swithes.

    https://lkml.org/lkml/2015/4/21/11

    Until finishing the analysis on this issue, I'd like to revert this for a while.

    This reverts commit 78373b7319abdf15050af5b1632c4c8b8b398f33.

    Jaegeuk Kim
     
  • With sessions in v4.1 or later we don't need to manually probe the backchannel
    connection, so we can declare it up instantly after setting up the RPC client.

    Note that we really should split nfsd4_run_cb_work in the long run, this is
    just the least intrusive fix for now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: J. Bruce Fields

    Christoph Hellwig
     
  • Checking the rpc_client pointer is not a reliable way to detect
    backchannel changes: cl_cb_client is changed only after shutting down
    the rpc client, so the condition cl_cb_client = tk_client will always be
    true.

    Check the RPC_TASK_KILLED flag instead, and rewrite the code to avoid
    the buggy cl_callbacks list and fix the lifetime rules due to double
    calls of the ->prepare callback operations method for this retry case.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: J. Bruce Fields

    Christoph Hellwig
     
  • We must only increment the sequence id if the client has seen and responded
    to a request. If we failed to deliver it to the client we must resend with
    the same sequence id. So just like the client track errors at the transport
    level differently from those returned in the XDR.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: J. Bruce Fields

    Christoph Hellwig
     
  • For the sake of forgetful clients, the server should return the layouts
    to the file system on 'last close' of a file (assuming that there are no
    delegations outstanding to that particular client) or on delegreturn
    (assuming that there are no opens on a file from that particular
    client).

    In theory the information is all there in current data structures, but
    it's not efficiently available; nfs4_file->fi_ref includes references on
    the file across all clients, but we need a per-(client, file) count.
    Walking through lots of stateid's to calculate this on each close or
    delegreturn would be painful.

    This patch introduces infrastructure to maintain per-client opens and
    delegation counters on a per-file basis.

    [hch: ported to the mainline pNFS support, merged various fixes from Jeff]
    Signed-off-by: Sachin Bhamare
    Signed-off-by: Jeff Layton
    Signed-off-by: Christoph Hellwig
    Signed-off-by: J. Bruce Fields

    Sachin Bhamare
     
  • If we find a non-confirmed openowner we jump to exit the function, but do
    not set an error value. Fix this by factoring out a helper to do the
    check and properly set the error from nfsd4_validate_stateid.

    Cc: stable@vger.kernel.org
    Signed-off-by: Christoph Hellwig
    Signed-off-by: J. Bruce Fields

    Christoph Hellwig