Eric Lee / smarc-fsl-linux-kernel

17 May, 2015

3 commits

92752b5cd Merge branch 'for-linus-4.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml ... Browse Code »

Pull UML hostfs fix from Richard Weinberger:
"This contains a single fix for a regression introduced in 4.1-rc1"

* 'for-linus-4.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
hostfs: Use correct mask for file mode

Linus Torvalds
2015-05-17 07:33:59 +0800
6a8098a44 Merge tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 ... Browse Code »

Pull ext4 fixes from Ted Ts'o:
"Fix a number of ext4 bugs; the most serious of which is a bug in the
lazytime mount optimization code where we could end up updating the
timestamps to the wrong inode"

* tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix an ext3 collapse range regression in xfstests
jbd2: fix r_count overflows leading to buffer overflow in journal recovery
ext4: check for zero length extent explicitly
ext4: fix NULL pointer dereference when journal restart fails
ext4: remove unused function prototype from ext4.h
ext4: don't save the error information if the block device is read-only
ext4: fix lazytime optimization

Linus Torvalds
2015-05-17 06:55:31 +0800
c7309e88a Merge branch 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull btrfs fixes from Chris Mason:
"The first commit is a fix from Filipe for a very old extent buffer
reuse race that triggered a BUG_ON. It hasn't come up often, I looked
through old logs at FB and we hit it a handful of times over the last
year.

The rest are other corners he hit during testing"

* 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix race when reusing stale extent buffers that leads to BUG_ON
Btrfs: fix race between block group creation and their cache writeout
Btrfs: fix panic when starting bg cache writeout after IO error
Btrfs: fix crash after inode cache writeback failure

Linus Torvalds
2015-05-17 06:50:58 +0800

16 May, 2015

1 commit

4b470f120 Merge branch 'parisc-4.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux ... Browse Code »

Pull parisc fixes from Helge Deller:
"One important patch which fixes crashes due to stack randomization on
architectures where the stack grows upwards (currently parisc and
metag only).

This bug went unnoticed on parisc since kernel 3.14 where the flexible
mmap memory layout support was added by commit 9dabf60dc4ab. The
changes in fs/exec.c are inside an #ifdef CONFIG_STACK_GROWSUP section
and will not affect other platforms.

The other two patches rename args of the kthread_arg() function and
fixes a printk output"

* 'parisc-4.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc,metag: Fix crashes due to stack randomization on stack-grows-upwards architectures
parisc: copy_thread(): rename 'arg' argument to 'kthread_arg'
parisc: %pf is only for function pointers

Linus Torvalds
2015-05-16 04:06:06 +0800

15 May, 2015

8 commits

b9576fc36 ext4: fix an ext3 collapse range regression in xfstests ... Browse Code »

The xfstests test suite assumes that an attempt to collapse range on
the range (0, 1) will return EOPNOTSUPP if the file system does not
support collapse range. Commit 280227a75b56: "ext4: move check under
lock scope to close a race" broke this, and this caused xfstests to
fail when run when testing file systems that did not have the extents
feature enabled.

Reported-by: Eric Whitney
Signed-off-by: Theodore Ts'o

Theodore Ts'o
2015-05-15 12:24:10 +0800
499611ed4 kernfs: do not account ino_ida allocations to memcg ... Browse Code »

root->ino_ida is used for kernfs inode number allocations. Since IDA has
a layered structure, different IDs can reside on the same layer, which
is currently accounted to some memory cgroup. The problem is that each
kmem cache of a memory cgroup has its own directory on sysfs (under
/sys/fs/kernel//cgroup). If the inode number of such a
directory or any file in it gets allocated from a layer accounted to the
cgroup which the cache is created for, the cgroup will get pinned for
good, because one has to free all kmem allocations accounted to a cgroup
in order to release it and destroy all its kmem caches. That said we
must not account layers of ino_ida to any memory cgroup.

Since per net init operations may create new sysfs entries directly
(e.g. lo device) or indirectly (nf_conntrack creates a new kmem cache
per each namespace, which, in turn, creates new sysfs entries), an easy
way to reproduce this issue is by creating network namespace(s) from
inside a kmem-active memory cgroup.

Signed-off-by: Vladimir Davydov
Acked-by: Tejun Heo
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Greg Thelen
Cc: Greg Kroah-Hartman
Cc: [4.0.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2015-05-15 08:55:51 +0800
e531d0bce jbd2: fix r_count overflows leading to buffer overflow in journal recovery ... Browse Code »

The journal revoke block recovery code does not check r_count for
sanity, which means that an evil value of r_count could result in
the kernel reading off the end of the revoke table and into whatever
garbage lies beyond. This could crash the kernel, so fix that.

However, in testing this fix, I discovered that the code to write
out the revoke tables also was not correctly checking to see if the
block was full -- the current offset check is fine so long as the
revoke table space size is a multiple of the record size, but this
is not true when either journal_csum_v[23] are set.

Signed-off-by: Darrick J. Wong
Signed-off-by: Theodore Ts'o
Reviewed-by: Jan Kara
Cc: stable@vger.kernel.org

Darrick J. Wong
2015-05-15 07:11:50 +0800
2f974865f ext4: check for zero length extent explicitly ... Browse Code »

The following commit introduced a bug when checking for zero length extent

5946d08 ext4: check for overlapping extents in ext4_valid_extent_entries()

Zero length extent could pass the check if lblock is zero.

Adding the explicit check for zero length back.

Signed-off-by: Eryu Guan
Signed-off-by: Theodore Ts'o
Cc: stable@vger.kernel.org

Eryu Guan
2015-05-15 07:00:45 +0800
9d5065940 ext4: fix NULL pointer dereference when journal restart fails ... Browse Code »

Currently when journal restart fails, we'll have the h_transaction of
the handle set to NULL to indicate that the handle has been effectively
aborted. We handle this situation quietly in the jbd2_journal_stop() and just
free the handle and exit because everything else has been done before we
attempted (and failed) to restart the journal.

Unfortunately there are a number of problems with that approach
introduced with commit

41a5b913197c "jbd2: invalidate handle if jbd2_journal_restart()
fails"

First of all in ext4 jbd2_journal_stop() will be called through
__ext4_journal_stop() where we would try to get a hold of the superblock
by dereferencing h_transaction which in this case would lead to NULL
pointer dereference and crash.

In addition we're going to free the handle regardless of the refcount
which is bad as well, because others up the call chain will still
reference the handle so we might potentially reference already freed
memory.

Moreover it's expected that we'll get aborted handle as well as detached
handle in some of the journalling function as the error propagates up
the stack, so it's unnecessary to call WARN_ON every time we get
detached handle.

And finally we might leak some memory by forgetting to free reserved
handle in jbd2_journal_stop() in the case where handle was detached from
the transaction (h_transaction is NULL).

Fix the NULL pointer dereference in __ext4_journal_stop() by just
calling jbd2_journal_stop() quietly as suggested by Jan Kara. Also fix
the potential memory leak in jbd2_journal_stop() and use proper
handle refcounting before we attempt to free it to avoid use-after-free
issues.

And finally remove all WARN_ON(!transaction) from the code so that we do
not get random traces when something goes wrong because when journal
restart fails we will get to some of those functions.

Cc: stable@vger.kernel.org
Signed-off-by: Lukas Czerner
Signed-off-by: Theodore Ts'o
Reviewed-by: Jan Kara

Lukas Czerner
2015-05-15 06:55:18 +0800
92c826391 ext4: remove unused function prototype from ext4.h ... Browse Code »

The ext4_extent_tree_init() function hasn't been in the ext4 code for
a long time ago, except in an unused function prototype in ext4.h

Google-Bug-Id: 4530137
Signed-off-by: Theodore Ts'o

Theodore Ts'o
2015-05-15 06:43:36 +0800
1b46617b8 ext4: don't save the error information if the block device is read-only ... Browse Code »

Google-Bug-Id: 20939131
Signed-off-by: Theodore Ts'o

Theodore Ts'o
2015-05-15 06:37:30 +0800
8f4d85583 ext4: fix lazytime optimization ... Browse Code »

We had a fencepost error in the lazytime optimization which means that
timestamp would get written to the wrong inode.

Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o

Theodore Ts'o
2015-05-15 06:19:01 +0800

13 May, 2015

1 commit

d045c77c1 parisc,metag: Fix crashes due to stack randomization on stack-grows-upwards architectures ... Browse Code »

On architectures where the stack grows upwards (CONFIG_STACK_GROWSUP=y,
currently parisc and metag only) stack randomization sometimes leads to crashes
when the stack ulimit is set to lower values than STACK_RND_MASK (which is 8 MB
by default if not defined in arch-specific headers).

The problem is, that when the stack vm_area_struct is set up in fs/exec.c, the
additional space needed for the stack randomization (as defined by the value of
STACK_RND_MASK) was not taken into account yet and as such, when the stack
randomization code added a random offset to the stack start, the stack
effectively got smaller than what the user defined via rlimit_max(RLIMIT_STACK)
which then sometimes leads to out-of-stack situations and crashes.

This patch fixes it by adding the maximum possible amount of memory (based on
STACK_RND_MASK) which theoretically could be added by the stack randomization
code to the initial stack size. That way, the user-defined stack size is always
guaranteed to be at minimum what is defined via rlimit_max(RLIMIT_STACK).

This bug is currently not visible on the metag architecture, because on metag
STACK_RND_MASK is defined to 0 which effectively disables stack randomization.

The changes to fs/exec.c are inside an "#ifdef CONFIG_STACK_GROWSUP"
section, so it does not affect other platformws beside those where the
stack grows upwards (parisc and metag).

Signed-off-by: Helge Deller
Cc: linux-parisc@vger.kernel.org
Cc: James Hogan
Cc: linux-metag@vger.kernel.org
Cc: stable@vger.kernel.org # v3.16+

Helge Deller
2015-05-13 04:03:44 +0800

12 May, 2015

1 commit

4cfceaf0c Merge branch 'for-4.1' of git://linux-nfs.org/~bfields/linux ... Browse Code »

Pull nfsd bugfixes from Bruce Fields:
"Mainly pnfs fixes (and for problems with generic callback code made
more obvious by pnfs)"

* 'for-4.1' of git://linux-nfs.org/~bfields/linux:
nfsd: skip CB_NULL probes for 4.1 or later
nfsd: fix callback restarts
nfsd: split transport vs operation errors for callbacks
svcrpc: fix potential GSSX_ACCEPT_SEC_CONTEXT decoding failures
nfsd: fix pNFS return on close semantics
nfsd: fix the check for confirmed openowner in nfs4_preprocess_stateid_op
nfsd/blocklayout: pretend we can send deviceid notifications

Linus Torvalds
2015-05-12 05:42:52 +0800

11 May, 2015

4 commits

062c19e9d Btrfs: fix race when reusing stale extent buffers that leads to BUG_ON ... Browse Code »

There's a race between releasing extent buffers that are flagged as stale
and recycling them that makes us it the following BUG_ON at
btrfs_release_extent_buffer_page:

BUG_ON(extent_buffer_under_io(eb))

The BUG_ON is triggered because the extent buffer has the flag
EXTENT_BUFFER_DIRTY set as a consequence of having been reused and made
dirty by another concurrent task.

Here follows a sequence of steps that leads to the BUG_ON.

CPU 0 CPU 1 CPU 2

path->nodes[0] == eb X
X->refs == 2 (1 for the tree, 1 for the path)
btrfs_header_generation(X) == current trans id
flag EXTENT_BUFFER_DIRTY set on X

btrfs_release_path(path)
unlocks X

reads eb X
X->refs incremented to 3
locks eb X
btrfs_del_items(X)
X becomes empty
clean_tree_block(X)
clear EXTENT_BUFFER_DIRTY from X
btrfs_del_leaf(X)
unlocks X
extent_buffer_get(X)
X->refs incremented to 4
btrfs_free_tree_block(X)
X's range is not pinned
X's range added to free
space cache
free_extent_buffer_stale(X)
lock X->refs_lock
set EXTENT_BUFFER_STALE on X
release_extent_buffer(X)
X->refs decremented to 3
unlocks X->refs_lock
btrfs_release_path()
unlocks X
free_extent_buffer(X)
X->refs becomes 2

__btrfs_cow_block(Y)
btrfs_alloc_tree_block()
btrfs_reserve_extent()
find_free_extent()
gets offset == X->start
btrfs_init_new_buffer(X->start)
btrfs_find_create_tree_block(X->start)
alloc_extent_buffer(X->start)
find_extent_buffer(X->start)
finds eb X in radix tree

free_extent_buffer(X)
lock X->refs_lock
test X->refs == 2
test bit EXTENT_BUFFER_STALE is set
test !extent_buffer_under_io(eb)

increments X->refs to 3
mark_extent_buffer_accessed(X)
check_buffer_tree_ref(X)
--> does nothing,
X->refs >= 2 and
EXTENT_BUFFER_TREE_REF
is set in X
clear EXTENT_BUFFER_STALE from X
locks X
btrfs_mark_buffer_dirty()
set_extent_buffer_dirty(X)
check_buffer_tree_ref(X)
--> does nothing, X->refs >= 2 and
EXTENT_BUFFER_TREE_REF is set
sets EXTENT_BUFFER_DIRTY on X

test and clear EXTENT_BUFFER_TREE_REF
decrements X->refs to 2
release_extent_buffer(X)
decrements X->refs to 1
unlock X->refs_lock

unlock X
free_extent_buffer(X)
lock X->refs_lock
release_extent_buffer(X)
decrements X->refs to 0
btrfs_release_extent_buffer_page(X)
BUG_ON(extent_buffer_under_io(X))
--> EXTENT_BUFFER_DIRTY set on X

Fix this by making find_extent buffer wait for any ongoing task currently
executing free_extent_buffer()/free_extent_buffer_stale() if the extent
buffer has the stale flag set.
A more clean alternative would be to always increment the extent buffer's
reference count while holding its refs_lock spinlock but find_extent_buffer
is a performance critical area and that would cause lock contention whenever
multiple tasks search for the same extent buffer concurrently.

A build server running a SLES 12 kernel (3.12 kernel + over 450 upstream
btrfs patches backported from newer kernels) was hitting this often:

[1212302.461948] kernel BUG at ../fs/btrfs/extent_io.c:4507!
(...)
[1212302.470219] CPU: 1 PID: 19259 Comm: bs_sched Not tainted 3.12.36-38-default #1
[1212302.540792] Hardware name: Supermicro PDSM4/PDSM4, BIOS 6.00 04/17/2006
[1212302.540792] task: ffff8800e07e0100 ti: ffff8800d6412000 task.ti: ffff8800d6412000
[1212302.540792] RIP: 0010:[] [] btrfs_release_extent_buffer_page.constprop.51+0x101/0x110 [btrfs]
(...)
[1212302.630008] Call Trace:
[1212302.630008] [] release_extent_buffer+0x3d/0xa0 [btrfs]
[1212302.630008] [] btrfs_release_path+0x1d/0xa0 [btrfs]
[1212302.630008] [] read_block_for_search.isra.33+0x13e/0x3a0 [btrfs]
[1212302.630008] [] btrfs_search_slot+0x3f4/0xa80 [btrfs]
[1212302.630008] [] lookup_inline_extent_backref+0xf8/0x630 [btrfs]
[1212302.630008] [] __btrfs_free_extent+0x11d/0xc40 [btrfs]
[1212302.630008] [] __btrfs_run_delayed_refs+0x394/0x11d0 [btrfs]
[1212302.630008] [] btrfs_run_delayed_refs.part.66+0x69/0x280 [btrfs]
[1212302.630008] [] __btrfs_end_transaction+0x2ad/0x3d0 [btrfs]
[1212302.630008] [] btrfs_evict_inode+0x4a5/0x500 [btrfs]
[1212302.630008] [] evict+0xa8/0x190
[1212302.630008] [] do_unlinkat+0x1a0/0x2b0

I was also able to reproduce this on a 3.19 kernel, corresponding to Chris'
integration branch from about a month ago, running the following stress
test on a qemu/kvm guest (with 4 virtual cpus and 16Gb of ram):

while true; do
mkfs.btrfs -l 4096 -f -b `expr 20 \* 1024 \* 1024 \* 1024` /dev/sdd
mount /dev/sdd /mnt
snapshot_cmd="btrfs subvolume snapshot -r /mnt"
snapshot_cmd="$snapshot_cmd /mnt/snap_\`date +'%H_%M_%S_%N'\`"
fsstress -d /mnt -n 25000 -p 8 -x "$snapshot_cmd" -X 100
umount /mnt
done

Which usually triggers the BUG_ON within less than 24 hours:

[49558.618097] ------------[ cut here ]------------
[49558.619732] kernel BUG at fs/btrfs/extent_io.c:4551!
(...)
[49558.620031] CPU: 3 PID: 23908 Comm: fsstress Tainted: G W 3.19.0-btrfs-next-7+ #3
[49558.620031] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[49558.620031] task: ffff8800319fc0d0 ti: ffff880220da8000 task.ti: ffff880220da8000
[49558.620031] RIP: 0010:[] [] btrfs_release_extent_buffer_page+0x20/0xe9 [btrfs]
(...)
[49558.620031] Call Trace:
[49558.620031] [] release_extent_buffer+0x90/0xd3 [btrfs]
[49558.620031] [] ? _raw_spin_lock+0x3b/0x43
[49558.620031] [] ? free_extent_buffer+0x37/0x94 [btrfs]
[49558.620031] [] free_extent_buffer+0x90/0x94 [btrfs]
[49558.620031] [] btrfs_release_path+0x4a/0x69 [btrfs]
[49558.620031] [] __btrfs_free_extent+0x778/0x80c [btrfs]
[49558.620031] [] __btrfs_run_delayed_refs+0xad2/0xc62 [btrfs]
[49558.728054] [] ? kmemleak_alloc_recursive.constprop.52+0x16/0x18
[49558.728054] [] btrfs_run_delayed_refs+0x6d/0x1ba [btrfs]
[49558.728054] [] ? join_transaction.isra.9+0xb9/0x36b [btrfs]
[49558.728054] [] btrfs_commit_transaction+0x4c/0x981 [btrfs]
[49558.728054] [] btrfs_sync_fs+0xd5/0x10d [btrfs]
[49558.728054] [] ? iterate_supers+0x60/0xc4
[49558.728054] [] ? do_sync_work+0x91/0x91
[49558.728054] [] sync_fs_one_sb+0x20/0x22
[49558.728054] [] iterate_supers+0x76/0xc4
[49558.728054] [] sys_sync+0x55/0x83
[49558.728054] [] system_call_fastpath+0x12/0x17

Signed-off-by: Filipe Manana
Reviewed-by: David Sterba
Signed-off-by: Chris Mason

Filipe Manana
2015-05-11 22:59:11 +0800
ff1f8250a Btrfs: fix race between block group creation and their cache writeout ... Browse Code »

So creating a block group has 2 distinct phases:

Phase 1 - creates the btrfs_block_group_cache item and adds it to the
rbtree fs_info->block_group_cache_tree and to the corresponding list
space_info->block_groups[];

Phase 2 - adds the block group item to the extent tree and corresponding
items to the chunk tree.

The first phase adds the block_group_cache_item to a list of pending block
groups in the transaction handle, and phase 2 happens when
btrfs_end_transaction() is called against the transaction handle.

It happens that once phase 1 completes, other concurrent tasks that use
their own transaction handle, but points to the same running transaction
(struct btrfs_trans_handle->transaction), can use this block group for
space allocations and therefore mark it dirty. Dirty block groups are
tracked in a list belonging to the currently running transaction (struct
btrfs_transaction) and not in the transaction handle (btrfs_trans_handle).

This is a problem because once a task calls btrfs_commit_transaction(),
it calls btrfs_start_dirty_block_groups() which will see all dirty block
groups and attempt to start their writeout, including those that are
still attached to the transaction handle of some concurrent task that
hasn't called btrfs_end_transaction() yet - which means those block
groups haven't gone through phase 2 yet and therefore when
write_one_cache_group() is called, it won't find the block group items
in the extent tree and abort the current transaction with -ENOENT,
turning the fs into readonly mode and require a remount.

Fix this by ignoring -ENOENT when looking for block group items in the
extent tree when we attempt to start the writeout of the block group
caches outside the critical section of the transaction commit. We will
try again later during the critical section and if there we still don't
find the block group item in the extent tree, we then abort the current
transaction.

This issue happened twice, once while running fstests btrfs/067 and once
for btrfs/078, which produced the following trace:

[ 3278.703014] WARNING: CPU: 7 PID: 18499 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
[ 3278.707329] BTRFS: Transaction aborted (error -2)
(...)
[ 3278.731555] Call Trace:
[ 3278.732396] [] dump_stack+0x4f/0x7b
[ 3278.733860] [] ? console_unlock+0x361/0x3ad
[ 3278.735312] [] warn_slowpath_common+0xa1/0xbb
[ 3278.736874] [] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
[ 3278.738302] [] warn_slowpath_fmt+0x46/0x48
[ 3278.739520] [] __btrfs_abort_transaction+0x52/0x114 [btrfs]
[ 3278.741222] [] write_one_cache_group+0xae/0xbf [btrfs]
[ 3278.742797] [] btrfs_start_dirty_block_groups+0x170/0x2b2 [btrfs]
[ 3278.744492] [] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
[ 3278.746084] [] ? trace_hardirqs_on+0xd/0xf
[ 3278.747249] [] btrfs_sync_file+0x313/0x387 [btrfs]
[ 3278.748744] [] vfs_fsync_range+0x95/0xa4
[ 3278.749958] [] ? ret_from_sys_call+0x1d/0x58
[ 3278.751218] [] vfs_fsync+0x1c/0x1e
[ 3278.754197] [] do_fsync+0x34/0x4e
[ 3278.755192] [] SyS_fsync+0x10/0x14
[ 3278.756236] [] system_call_fastpath+0x12/0x17
[ 3278.757366] ---[ end trace 9a4d4df4969709aa ]---

Fixes: 1bbc621ef284 ("Btrfs: allow block group cache writeout
outside critical section in commit")

Signed-off-by: Filipe Manana
Signed-off-by: Chris Mason

Filipe Manana
2015-05-11 22:59:10 +0800
28aeeac1d Btrfs: fix panic when starting bg cache writeout after IO error ... Browse Code »

When waiting for the writeback of block group cache we returned
immediately if there was an error during writeback without waiting
for the ordered extent to complete. This left a short time window
where if some other task attempts to start the writeout for the same
block group cache it can attempt to add a new ordered extent, starting
at the same offset (0) before the previous one is removed from the
ordered tree, causing an ordered tree panic (calls BUG()).

This normally doesn't happen in other write paths, such as buffered
writes or direct IO writes for regular files, since before marking
page ranges dirty we lock the ranges and wait for any ordered extents
within the range to complete first.

Fix this by making btrfs_wait_ordered_range() not return immediately
if it gets an error from the writeback, waiting for all ordered extents
to complete first.

This issue happened often when running the fstest btrfs/088 and it's
easy to trigger it by running in a loop until the panic happens:

for ((i = 1; i ] btrfs_add_ordered_extent+0x12/0x14 [btrfs]
[17156.864052] [] run_delalloc_nocow+0x5bf/0x747 [btrfs]
[17156.864052] [] run_delalloc_range+0x95/0x353 [btrfs]
[17156.864052] [] writepage_delalloc.isra.16+0xb9/0x13f [btrfs]
[17156.864052] [] __extent_writepage+0x129/0x1f7 [btrfs]
[17156.864052] [] extent_write_cache_pages.isra.15.constprop.28+0x231/0x2f4 [btrfs]
[17156.864052] [] ? __module_text_address+0x12/0x59
[17156.864052] [] ? trace_hardirqs_on+0xd/0xf
[17156.864052] [] extent_writepages+0x4b/0x5c [btrfs]
[17156.864052] [] ? kmem_cache_free+0x9b/0xce
[17156.864052] [] ? btrfs_submit_direct+0x3fc/0x3fc [btrfs]
[17156.864052] [] ? free_extent_state+0x8c/0xc1 [btrfs]
[17156.864052] [] btrfs_writepages+0x28/0x2a [btrfs]
[17156.864052] [] do_writepages+0x23/0x2c
[17156.864052] [] __filemap_fdatawrite_range+0x5a/0x61
[17156.864052] [] filemap_fdatawrite_range+0x13/0x15
[17156.864052] [] btrfs_fdatawrite_range+0x21/0x48 [btrfs]
[17156.864052] [] __btrfs_write_out_cache.isra.14+0x2d9/0x3a7 [btrfs]
[17156.864052] [] ? btrfs_write_out_cache+0x41/0xdc [btrfs]
[17156.864052] [] btrfs_write_out_cache+0x93/0xdc [btrfs]
[17156.864052] [] ? btrfs_start_dirty_block_groups+0x13a/0x2b2 [btrfs]
[17156.864052] [] btrfs_start_dirty_block_groups+0x1d9/0x2b2 [btrfs]
[17156.864052] [] ? trace_hardirqs_on+0xd/0xf
[17156.864052] [] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
[17156.864052] [] btrfs_sync_fs+0xe1/0x12d [btrfs]

Signed-off-by: Filipe Manana
Signed-off-by: Chris Mason

Filipe Manana
2015-05-11 22:59:10 +0800
e43699d4b Btrfs: fix crash after inode cache writeback failure ... Browse Code »

If the writeback of an inode cache failed we were unnecessarilly
attempting to release again the delalloc metadata that we previously
reserved. However attempting to do this a second time triggers an
assertion at drop_outstanding_extent() because we have no more
outstanding extents for our inode cache's inode. If we were able
to start writeback of the cache the reserved metadata space is
released at btrfs_finished_ordered_io(), even if an error happens
during writeback.

So make sure we don't repeat the metadata space release if writeback
started for our inode cache.

This issue was trivial to reproduce by running the fstest btrfs/088
with "-o inode_cache", which triggered the assertion leading to a
BUG() call and requiring a reboot in order to run the remaining
fstests. Trace produced by btrfs/088:

[255289.385904] BTRFS: assertion failed: BTRFS_I(inode)->outstanding_extents >= num_extents, file: fs/btrfs/extent-tree.c, line: 5276
[255289.388094] ------------[ cut here ]------------
[255289.389184] kernel BUG at fs/btrfs/ctree.h:4057!
[255289.390125] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
(...)
[255289.392068] Call Trace:
[255289.392068] [] drop_outstanding_extent+0x3d/0x6d [btrfs]
[255289.392068] [] btrfs_delalloc_release_metadata+0x54/0xe3 [btrfs]
[255289.392068] [] btrfs_write_out_ino_cache+0x95/0xad [btrfs]
[255289.392068] [] btrfs_save_ino_cache+0x275/0x2dc [btrfs]
[255289.392068] [] commit_fs_roots.isra.12+0xaa/0x137 [btrfs]
[255289.392068] [] ? trace_hardirqs_on+0xd/0xf
[255289.392068] [] ? btrfs_commit_transaction+0x4b1/0x9c9 [btrfs]
[255289.392068] [] ? _raw_spin_unlock+0x32/0x46
[255289.392068] [] btrfs_commit_transaction+0x4c0/0x9c9 [btrfs]
(...)

Signed-off-by: Filipe Manana
Signed-off-by: Chris Mason

Filipe Manana
2015-05-11 22:59:10 +0800

10 May, 2015

2 commits

51dfcb076 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull user-namespace fix from Eric Biederman:
"Eric Windish recently reported a really bug that allows mounting fresh
copies of proc and sysfs when it really should not be allowed. The
code attempted to verify that proc and sysfs were fully visible but
there is a test missing to ensure that the root of the filesystem is
visible. Doh!

The following patch fixes that.

This fixes a containment issue that the docker folks are seeing"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
mnt: Fix fs_fully_visible to verify the root directory is visible

Linus Torvalds
2015-05-10 07:07:14 +0800
7e96c1b0e mnt: Fix fs_fully_visible to verify the root directory is visible ... Browse Code »

This fixes a dumb bug in fs_fully_visible that allows proc or sys to
be mounted if there is a bind mount of part of /proc/ or /sys/ visible.

Cc: stable@vger.kernel.org
Reported-by: Eric Windisch
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2015-05-10 00:55:50 +0800

09 May, 2015

5 commits

95c607d93 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs fixes from Al Viro:
"A couple of fixes for bugs caught while digging in fs/namei.c. The
first one is this cycle regression, the second is 3.11 and later"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
path_openat(): fix double fput()
namei: d_is_negative() should be checked before ->d_seq validation

Linus Torvalds
2015-05-09 12:39:12 +0800
f15133df0 path_openat(): fix double fput() ... Browse Code »

path_openat() jumps to the wrong place after do_tmpfile() - it has
already done path_cleanup() (as part of path_lookupat() called by
do_tmpfile()), so doing that again can lead to double fput().

Cc: stable@vger.kernel.org # v3.11+
Signed-off-by: Al Viro

Al Viro
2015-05-09 12:12:48 +0800
766c4cbfa namei: d_is_negative() should be checked before ->d_seq validation ... Browse Code »

Fetching ->d_inode, verifying ->d_seq and finding d_is_negative() to
be true does *not* mean that inode we'd fetched had been NULL - that
holds only while ->d_seq is still unchanged.

Shift d_is_negative() checks into lookup_fast() prior to ->d_seq
verification.

Reported-by: Steven Rostedt
Tested-by: Steven Rostedt
Signed-off-by: Al Viro

Al Viro
2015-05-09 12:12:35 +0800
af6472881 Merge branch 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull btrfs fix from Chris Mason:
"When an arm user reported crashes near page_address(page) in my new
code, it became clear that I can't be trusted with GFP masks. Filipe
beat me to the patch, and I'll just be in the corner with my dunce cap
on"

* 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix wrong mapping flags for free space inode

Linus Torvalds
2015-05-09 11:59:02 +0800
1daac193f Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block fixes from Jens Axboe:
"A collection of fixes since the merge window;

- fix for a double elevator module release, from Chao Yu. Ancient bug.

- the splice() MORE flag fix from Christophe Leroy.

- a fix for NVMe, fixing a patch that went in in the merge window.
From Keith.

- two fixes for blk-mq CPU hotplug handling, from Ming Lei.

- bdi vs blockdev lifetime fix from Neil Brown, fixing and oops in md.

- two blk-mq fixes from Shaohua, fixing a race on queue stop and a
bad merge issue with FUA writes.

- division-by-zero fix for writeback from Tejun.

- a block bounce page accounting fix, making sure we inc/dec after
bouncing so that pre/post IO pages match up. From Wang YanQing"

* 'for-linus' of git://git.kernel.dk/linux-block:
splice: sendfile() at once fails for big files
blk-mq: don't lose requests if a stopped queue restarts
blk-mq: fix FUA request hang
block: destroy bdi before blockdev is unregistered.
block:bounce: fix call inc_|dec_zone_page_state on different pages confuse value of NR_BOUNCE
elevator: fix double release of elevator module
writeback: use |1 instead of +1 to protect against div by zero
blk-mq: fix CPU hotplug handling
blk-mq: fix race between timeout and CPU hotplug
NVMe: Fix VPD B0 max sectors translation

Linus Torvalds
2015-05-09 10:49:35 +0800

08 May, 2015

1 commit

68c2f356c Merge tag 'for-f2fs-4.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs ... Browse Code »

Pull f2fs fixes from Jaegeuk Kim:
"Fix a performance regression and a bug"

* tag 'for-f2fs-4.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
f2fs: fix wrong error hanlder in f2fs_follow_link
Revert "f2fs: enhance multi-threads performance"

Linus Torvalds
2015-05-08 02:18:34 +0800

07 May, 2015

2 commits

1d3c61c2e Btrfs: fix wrong mapping flags for free space inode ... Browse Code »

We were passing a flags value that differed from the intention in commit
2b108268006e ("Btrfs: don't use highmem for free space cache pages").

This caused problems in a ARM machine, leaving btrfs unusable there.

Reported-by: Merlijn Wajer
Tested-by: Merlijn Wajer
Signed-off-by: Filipe Manana
Signed-off-by: Chris Mason

Filipe Manana
2015-05-07 08:06:13 +0800
3d54ac9e3 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86 fixes from Ingo Molnar:
"EFI fixes, and FPU fix, a ticket spinlock boundary condition fix and
two build fixes"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Always restore_xinit_state() when use_eager_cpu()
x86: Make cpu_tss available to external modules
efi: Fix error handling in add_sysfs_runtime_map_entry()
x86/spinlocks: Fix regression in spinlock contention detection
x86/mm: Clean up types in xlate_dev_mem_ptr()
x86/efi: Store upper bits of command line buffer address in ext_cmd_line_ptr
efivarfs: Ensure VariableName is NUL-terminated

Linus Torvalds
2015-05-07 01:57:37 +0800

06 May, 2015

5 commits

0ff28d9f4 splice: sendfile() at once fails for big files ... Browse Code »

Using sendfile with below small program to get MD5 sums of some files,
it appear that big files (over 64kbytes with 4k pages system) get a
wrong MD5 sum while small files get the correct sum.
This program uses sendfile() to send a file to an AF_ALG socket
for hashing.

/* md5sum2.c */
#include
#include
#include
#include
#include
#include
#include
#include
#include

int main(int argc, char **argv)
{
int sk = socket(AF_ALG, SOCK_SEQPACKET, 0);
struct stat st;
struct sockaddr_alg sa = {
.salg_family = AF_ALG,
.salg_type = "hash",
.salg_name = "md5",
};
int n;

bind(sk, (struct sockaddr*)&sa, sizeof(sa));

for (n = 1; n < argc; n++) {
int size;
int offset = 0;
char buf[4096];
int fd;
int sko;
int i;

fd = open(argv[n], O_RDONLY);
sko = accept(sk, NULL, 0);
fstat(fd, &st);
size = st.st_size;
sendfile(sko, fd, &offset, size);
size = read(sko, buf, sizeof(buf));
for (i = 0; i < size; i++)
printf("%2.2x", buf[i]);
printf(" %s\n", argv[n]);
close(fd);
close(sko);
}
exit(0);
}

Test below is done using official linux patch files. First result is
with a software based md5sum. Second result is with the program above.

root@vgoip:~# ls -l patch-3.6.*
-rw-r--r-- 1 root root 64011 Aug 24 12:01 patch-3.6.2.gz
-rw-r--r-- 1 root root 94131 Aug 24 12:01 patch-3.6.3.gz

root@vgoip:~# md5sum patch-3.6.*
b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz
c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz

root@vgoip:~# ./md5sum2 patch-3.6.*
b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz
5fd77b24e68bb24dcc72d6e57c64790e patch-3.6.3.gz

After investivation, it appears that sendfile() sends the files by blocks
of 64kbytes (16 times PAGE_SIZE). The problem is that at the end of each
block, the SPLICE_F_MORE flag is missing, therefore the hashing operation
is reset as if it was the end of the file.

This patch adds SPLICE_F_MORE to the flags when more data is pending.

With the patch applied, we get the correct sums:

root@vgoip:~# md5sum patch-3.6.*
b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz
c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz

root@vgoip:~# ./md5sum2 patch-3.6.*
b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz
c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz

Signed-off-by: Christophe Leroy
Signed-off-by: Jens Axboe

Christophe Leroy
2015-05-06 23:27:41 +0800
c102cb097 Merge tag 'efi-urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into x86/urgent ... Browse Code »

Pull EFI fixes from Matt Fleming:

* Avoid garbage names in efivarfs due to buggy firmware by zeroing
EFI variable name. (Ross Lagerwall)

* Stop erroneously dropping upper 32 bits of boot command line pointer
in EFI boot stub and stash them in ext_cmd_line_ptr. (Roy Franz)

* Fix double-free bug in error handling code path of EFI runtime map
code. (Dan Carpenter)

Signed-off-by: Ingo Molnar

Ingo Molnar
2015-05-06 14:30:24 +0800
b1432a2a3 ocfs2: dlm: fix race between purge and get lock resource ... Browse Code »

There is a race window in dlm_get_lock_resource(), which may return a
lock resource which has been purged. This will cause the process to
hang forever in dlmlock() as the ast msg can't be handled due to its
lock resource not existing.

dlm_get_lock_resource {
...
spin_lock(&dlm->spinlock);
tmpres = __dlm_lookup_lockres_full(dlm, lockid, namelen, hash);
if (tmpres) {
spin_unlock(&dlm->spinlock);
>>>>>>>> race window, dlm_run_purge_list() may run and purge
the lock resource
spin_lock(&tmpres->spinlock);
...
spin_unlock(&tmpres->spinlock);
}
}

Signed-off-by: Junxiao Bi
Cc: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Junxiao Bi
2015-05-06 08:10:11 +0800
d8fd150fe nilfs2: fix sanity check of btree level in nilfs_btree_root_broken() ... Browse Code »

The range check for b-tree level parameter in nilfs_btree_root_broken()
is wrong; it accepts the case of "level == NILFS_BTREE_LEVEL_MAX" even
though the level is limited to values in the range of 0 to
(NILFS_BTREE_LEVEL_MAX - 1).

Since the level parameter is read from storage device and used to index
nilfs_btree_path array whose element count is NILFS_BTREE_LEVEL_MAX, it
can cause memory overrun during btree operations if the boundary value
is set to the level parameter on device.

This fixes the broken sanity check and adds a comment to clarify that
the upper bound NILFS_BTREE_LEVEL_MAX is exclusive.

Signed-off-by: Ryusuke Konishi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ryusuke Konishi
2015-05-06 08:10:11 +0800
f5b697700 configfs: init configfs module earlier at boot time ... Browse Code »

We need this earlier in the boot process to allow various subsystems to
use configfs (e.g Industrial IIO).

Also, debugfs is at core_initcall level and configfs should be on the same
level from infrastructure point of view.

Signed-off-by: Daniel Baluta
Suggested-by: Lars-Peter Clausen
Reviewed-by: Christoph Hellwig
Cc: Al Viro
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daniel Baluta
2015-05-06 08:10:11 +0800

05 May, 2015

7 commits

7263b1bd0 f2fs: fix wrong error hanlder in f2fs_follow_link ... Browse Code »

The page_follow_link_light returns NULL and its error pointer was remained
in nd->path.

Reported-by: Dan Carpenter
Reviewed-by: Chao Yu
Signed-off-by: Jaegeuk Kim

Jaegeuk Kim
2015-05-05 05:15:16 +0800
5463e7c18 Revert "f2fs: enhance multi-threads performance" ... Browse Code »

This reports performance regression by Yuanhan Liu.
The basic idea was to reduce one-point mutex, but it turns out this causes
another contention like context swithes.

https://lkml.org/lkml/2015/4/21/11

Until finishing the analysis on this issue, I'd like to revert this for a while.

This reverts commit 78373b7319abdf15050af5b1632c4c8b8b398f33.

Jaegeuk Kim
2015-05-05 05:15:15 +0800
4bd9e9b77 nfsd: skip CB_NULL probes for 4.1 or later ... Browse Code »

With sessions in v4.1 or later we don't need to manually probe the backchannel
connection, so we can declare it up instantly after setting up the RPC client.

Note that we really should split nfsd4_run_cb_work in the long run, this is
just the least intrusive fix for now.

Signed-off-by: Christoph Hellwig
Signed-off-by: J. Bruce Fields

Christoph Hellwig
2015-05-05 00:02:42 +0800
cba5f62b1 nfsd: fix callback restarts ... Browse Code »

Checking the rpc_client pointer is not a reliable way to detect
backchannel changes: cl_cb_client is changed only after shutting down
the rpc client, so the condition cl_cb_client = tk_client will always be
true.

Check the RPC_TASK_KILLED flag instead, and rewrite the code to avoid
the buggy cl_callbacks list and fix the lifetime rules due to double
calls of the ->prepare callback operations method for this retry case.

Signed-off-by: Christoph Hellwig
Signed-off-by: J. Bruce Fields

Christoph Hellwig
2015-05-05 00:02:41 +0800
ef2a1b3e1 nfsd: split transport vs operation errors for callbacks ... Browse Code »

We must only increment the sequence id if the client has seen and responded
to a request. If we failed to deliver it to the client we must resend with
the same sequence id. So just like the client track errors at the transport
level differently from those returned in the XDR.

Signed-off-by: Christoph Hellwig
Signed-off-by: J. Bruce Fields

Christoph Hellwig
2015-05-05 00:02:40 +0800
8287f009b nfsd: fix pNFS return on close semantics ... Browse Code »

For the sake of forgetful clients, the server should return the layouts
to the file system on 'last close' of a file (assuming that there are no
delegations outstanding to that particular client) or on delegreturn
(assuming that there are no opens on a file from that particular
client).

In theory the information is all there in current data structures, but
it's not efficiently available; nfs4_file->fi_ref includes references on
the file across all clients, but we need a per-(client, file) count.
Walking through lots of stateid's to calculate this on each close or
delegreturn would be painful.

This patch introduces infrastructure to maintain per-client opens and
delegation counters on a per-file basis.

[hch: ported to the mainline pNFS support, merged various fixes from Jeff]
Signed-off-by: Sachin Bhamare
Signed-off-by: Jeff Layton
Signed-off-by: Christoph Hellwig
Signed-off-by: J. Bruce Fields

Sachin Bhamare
2015-05-05 00:02:39 +0800
ebe9cb3bb nfsd: fix the check for confirmed openowner in nfs4_preprocess_stateid_op ... Browse Code »

If we find a non-confirmed openowner we jump to exit the function, but do
not set an error value. Fix this by factoring out a helper to do the
check and properly set the error from nfsd4_validate_stateid.

Cc: stable@vger.kernel.org
Signed-off-by: Christoph Hellwig
Signed-off-by: J. Bruce Fields

Christoph Hellwig
2015-05-05 00:02:38 +0800