Eric Lee / smarc-fsl-linux-kernel

18 Aug, 2015

2 commits

74278da9f inode: convert inode_sb_list_lock to per-sb ... Browse Code »

The process of reducing contention on per-superblock inode lists
starts with moving the locking to match the per-superblock inode
list. This takes the global lock out of the picture and reduces the
contention problems to within a single filesystem. This doesn't get
rid of contention as the locks still have global CPU scope, but it
does isolate operations on different superblocks form each other.

Signed-off-by: Dave Chinner
Signed-off-by: Josef Bacik
Reviewed-by: Jan Kara
Reviewed-by: Christoph Hellwig
Tested-by: Dave Chinner

Dave Chinner
2015-08-18 06:39:46 +0800
d353d7587 writeback: plug writeback at a high level ... Browse Code »

Doing writeback on lots of little files causes terrible IOPS storms
because of the per-mapping writeback plugging we do. This
essentially causes imeediate dispatch of IO for each mapping,
regardless of the context in which writeback is occurring.

IOWs, running a concurrent write-lots-of-small 4k files using fsmark
on XFS results in a huge number of IOPS being issued for data
writes. Metadata writes are sorted and plugged at a high level by
XFS, so aggregate nicely into large IOs. However, data writeback IOs
are dispatched in individual 4k IOs, even when the blocks of two
consecutively written files are adjacent.

Test VM: 8p, 8GB RAM, 4xSSD in RAID0, 100TB sparse XFS filesystem,
metadata CRCs enabled.

Kernel: 3.10-rc5 + xfsdev + my 3.11 xfs queue (~70 patches)

Test:

$ ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d
/mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d
/mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d
/mnt/scratch/6 -d /mnt/scratch/7

Result:

wall sys create rate Physical write IO
time CPU (avg files/s) IOPS Bandwidth
----- ----- ------------ ------ ---------
unpatched 6m56s 15m47s 24,000+/-500 26,000 130MB/s
patched 5m06s 13m28s 32,800+/-600 1,500 180MB/s
improvement -26.44% -14.68% +36.67% -94.23% +38.46%

If I use zero length files, this workload at about 500 IOPS, so
plugging drops the data IOs from roughly 25,500/s to 1000/s.
3 lines of code, 35% better throughput for 15% less CPU.

The benefits of plugging at this layer are likely to be higher for
spinning media as the IO patterns for this workload are going make a
much bigger difference on high IO latency devices.....

Signed-off-by: Dave Chinner
Signed-off-by: Josef Bacik
Reviewed-by: Jan Kara
Tested-by: Dave Chinner
Reviewed-by: Christoph Hellwig

Dave Chinner
2015-08-18 06:39:45 +0800

11 Aug, 2015

1 commit

a3ca013d8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull RCU pathwalk fix from Al Viro:
"Another racy use of nd->path.dentry in RCU mode"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
may_follow_link() should use nd->inode

Linus Torvalds
2015-08-11 01:04:47 +0800

09 Aug, 2015

1 commit

af0b3152b Merge branch 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull btrfs fix from Chris Mason:
"We have a btrfs quota regression fix.

I merged this one on Thursday and have run it through tests against
current master.

Normally I wouldn't have sent this while you were finalizing rc6, but
I'm feeding mosquitoes in the adirondacks next week, so I wanted to
get this one out before leaving. I'll leave longer tests running and
check on things during the week, but I don't expect any problems"

* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: qgroup: Fix a regression in qgroup reserved space.

Linus Torvalds
2015-08-09 10:56:31 +0800

07 Aug, 2015

7 commits

e1832f292 ipc: use private shmem or hugetlbfs inodes for shm segments. ... Browse Code »

The shm implementation internally uses shmem or hugetlbfs inodes for shm
segments. As these inodes are never directly exposed to userspace and
only accessed through the shm operations which are already hooked by
security modules, mark the inodes with the S_PRIVATE flag so that inode
security initialization and permission checking is skipped.

This was motivated by the following lockdep warning:

======================================================
[ INFO: possible circular locking dependency detected ]
4.2.0-0.rc3.git0.1.fc24.x86_64+debug #1 Tainted: G W
-------------------------------------------------------
httpd/1597 is trying to acquire lock:
(&ids->rwsem){+++++.}, at: shm_close+0x34/0x130
but task is already holding lock:
(&mm->mmap_sem){++++++}, at: SyS_shmdt+0x4b/0x180
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (&mm->mmap_sem){++++++}:
lock_acquire+0xc7/0x270
__might_fault+0x7a/0xa0
filldir+0x9e/0x130
xfs_dir2_block_getdents.isra.12+0x198/0x1c0 [xfs]
xfs_readdir+0x1b4/0x330 [xfs]
xfs_file_readdir+0x2b/0x30 [xfs]
iterate_dir+0x97/0x130
SyS_getdents+0x91/0x120
entry_SYSCALL_64_fastpath+0x12/0x76
-> #2 (&xfs_dir_ilock_class){++++.+}:
lock_acquire+0xc7/0x270
down_read_nested+0x57/0xa0
xfs_ilock+0x167/0x350 [xfs]
xfs_ilock_attr_map_shared+0x38/0x50 [xfs]
xfs_attr_get+0xbd/0x190 [xfs]
xfs_xattr_get+0x3d/0x70 [xfs]
generic_getxattr+0x4f/0x70
inode_doinit_with_dentry+0x162/0x670
sb_finish_set_opts+0xd9/0x230
selinux_set_mnt_opts+0x35c/0x660
superblock_doinit+0x77/0xf0
delayed_superblock_init+0x10/0x20
iterate_supers+0xb3/0x110
selinux_complete_init+0x2f/0x40
security_load_policy+0x103/0x600
sel_write_load+0xc1/0x750
__vfs_write+0x37/0x100
vfs_write+0xa9/0x1a0
SyS_write+0x58/0xd0
entry_SYSCALL_64_fastpath+0x12/0x76
...

Signed-off-by: Stephen Smalley
Reported-by: Morten Stevens
Acked-by: Hugh Dickins
Acked-by: Paul Moore
Cc: Manfred Spraul
Cc: Davidlohr Bueso
Cc: Prarit Bhargava
Cc: Eric Paris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Stephen Smalley
2015-08-07 09:39:41 +0800
32e5a2a2b ocfs2: fix shift left overflow ... Browse Code »

When using a large volume, for example 9T volume with 2T already used,
frequent creation of small files with O_DIRECT when the IO is not
cluster aligned may clear sectors in the wrong place. This will cause
filesystem corruption.

This is because p_cpos is a u32. When calculating the corresponding
sector it should be converted to u64 first, otherwise it may overflow.

Signed-off-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Cc: [4.0+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joseph Qi
2015-08-07 09:39:41 +0800
8f2f3eb59 fsnotify: fix oops in fsnotify_clear_marks_by_group_flags() ... Browse Code »

fsnotify_clear_marks_by_group_flags() can race with
fsnotify_destroy_marks() so that when fsnotify_destroy_mark_locked()
drops mark_mutex, a mark from the list iterated by
fsnotify_clear_marks_by_group_flags() can be freed and thus the next
entry pointer we have cached may become stale and we dereference free
memory.

Fix the problem by first moving marks to free to a special private list
and then always free the first entry in the special list. This method
is safe even when entries from the list can disappear once we drop the
lock.

Signed-off-by: Jan Kara
Reported-by: Ashish Sangwan
Reviewed-by: Ashish Sangwan
Cc: Lino Sanfilippo
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2015-08-07 09:39:41 +0800
3ead7c52b signalfd: fix information leak in signalfd_copyinfo ... Browse Code »

This function may copy the si_addr_lsb field to user mode when it hasn't
been initialized, which can leak kernel stack data to user mode.

Just checking the value of si_code is insufficient because the same
si_code value is shared between multiple signals. This is solved by
checking the value of si_signo in addition to si_code.

Signed-off-by: Amanieu d'Antras
Cc: Oleg Nesterov
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Amanieu d'Antras
2015-08-07 09:39:40 +0800
209f7512d ocfs2: fix BUG in ocfs2_downconvert_thread_do_work() ... Browse Code »

The "BUG_ON(list_empty(&osb->blocked_lock_list))" in
ocfs2_downconvert_thread_do_work can be triggered in the following case:

ocfs2dc has firstly saved osb->blocked_lock_count to local varibale
processed, and then processes the dentry lockres. During the dentry
put, it calls iput and then deletes rw, inode and open lockres from
blocked list in ocfs2_mark_lockres_freeing. And this causes the
variable `processed' to not reflect the number of blocked lockres to be
processed, which triggers the BUG.

Signed-off-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joseph Qi
2015-08-07 09:39:40 +0800
4248b0da4 fs, file table: reinit files_stat.max_files after deferred memory initialisation ... Browse Code »

Dave Hansen reported the following;

My laptop has been behaving strangely with 4.2-rc2. Once I log
in to my X session, I start getting all kinds of strange errors
from applications and see this in my dmesg:

VFS: file-max limit 8192 reached

The problem is that the file-max is calculated before memory is fully
initialised and miscalculates how much memory the kernel is using. This
patch recalculates file-max after deferred memory initialisation. Note
that using memory hotplug infrastructure would not have avoided this
problem as the value is not recalculated after memory hot-add.

4.1: files_stat.max_files = 6582781
4.2-rc2: files_stat.max_files = 8192
4.2-rc2 patched: files_stat.max_files = 6562467

Small differences with the patch applied and 4.1 but not enough to matter.

Signed-off-by: Mel Gorman
Reported-by: Dave Hansen
Cc: Nicolai Stange
Cc: Dave Hansen
Cc: Alex Ng
Cc: Fengguang Wu
Cc: Peter Zijlstra (Intel)
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-08-07 09:39:40 +0800
c05f9429e btrfs: qgroup: Fix a regression in qgroup reserved space. ... Browse Code »

During the change to new btrfs extent-oriented qgroup implement, due to
it doesn't use the old __qgroup_excl_accounting() for exclusive extent,
it didn't free the reserved bytes.

The bug will cause limit function go crazy as the reserved space is
never freed, increasing limit will have no effect and still cause
EQOUT.

The fix is easy, just free reserved bytes for newly created exclusive
extent as what it does before.

Reported-by: Tsutomu Itoh
Signed-off-by: Yang Dongsheng
Signed-off-by: Qu Wenruo
Signed-off-by: Chris Mason

Qu Wenruo
2015-08-07 05:51:15 +0800

05 Aug, 2015

2 commits

9e91edcd1 Merge branch 'for-4.2' of git://linux-nfs.org/~bfields/linux ... Browse Code »

Pull nfsd fixes from Bruce Fields.

* 'for-4.2' of git://linux-nfs.org/~bfields/linux:
nfsd: do nfs4_check_fh in nfs4_check_file instead of nfs4_check_olstateid
nfsd: Fix a file leak on nfsd4_layout_setlease failure
nfsd: Drop BUG_ON and ignore SECLABEL on absent filesystem

Linus Torvalds
2015-08-05 16:59:59 +0800
aa65fa35b may_follow_link() should use nd->inode ... Browse Code »

Now that we can get there in RCU mode, we shouldn't play with
nd->path.dentry->d_inode - it's not guaranteed to be stable.
Use nd->inode instead.

Reported-by: Hugh Dickins
Signed-off-by: Al Viro

Al Viro
2015-08-05 11:23:50 +0800

04 Aug, 2015

1 commit

7e884479b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client ... Browse Code »

Pull Ceph fixes from Sage Weil:
"There are two critical regression fixes for CephFS from Zheng, and an
RBD completion fix for layered images from Ilya"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
rbd: fix copyup completion race
ceph: always re-send cap flushes when MDS recovers
ceph: fix ceph_encode_locks_to_buffer()

Linus Torvalds
2015-08-04 02:09:07 +0800

02 Aug, 2015

2 commits

01183609a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull VFS fix from Al Viro:
"Spurious ENOTDIR fix"

This should fix the problems reported by Dominique Martinet and Hugh
Dickins.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
link_path_walk(): be careful when failing with ENOTDIR

Linus Torvalds
2015-08-02 08:42:14 +0800
97242f99a link_path_walk(): be careful when failing with ENOTDIR ... Browse Code »

In RCU mode we might end up with dentry evicted just we check
that it's a directory. In such case we should return ECHILD
rather than ENOTDIR, so that pathwalk would be retries in non-RCU
mode.

Breakage had been introduced in commit b18825a - prior to that
we were looking at nd->inode, which had been fetched before
verifying that ->d_seq was still valid. That form of check
would only be satisfied if at some point the pathname prefix
would indeed have resolved to a non-directory. The fix consists
of checking ->d_seq after we'd run into a non-directory dentry,
and failing with ECHILD in case of mismatch.

Note that all branches since 3.12 have that problem...

Signed-off-by: Al Viro

Al Viro
2015-08-02 08:18:38 +0800

01 Aug, 2015

2 commits

acea568fa Merge branch 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull btrfs fixes from Chris Mason:
"Filipe fixed up a hard to trigger ENOSPC regression from our merge
window pull, and we have a few other smaller fixes"

* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix quick exhaustion of the system array in the superblock
btrfs: its btrfs_err() instead of btrfs_error()
btrfs: Avoid NULL pointer dereference of free_extent_buffer when read_tree_block() fail
btrfs: Fix lockdep warning of btrfs_run_delayed_iputs()

Linus Torvalds
2015-08-01 08:05:37 +0800
8fcd461db nfsd: do nfs4_check_fh in nfs4_check_file instead of nfs4_check_olstateid ... Browse Code »

Currently, preprocess_stateid_op calls nfs4_check_olstateid which
verifies that the open stateid corresponds to the current filehandle in the
call by calling nfs4_check_fh.

If the stateid is a NFS4_DELEG_STID however, then no such check is done.
This could cause incorrect enforcement of permissions, because the
nfsd_permission() call in nfs4_check_file uses current the current
filehandle, but any subsequent IO operation will use the file descriptor
in the stateid.

Move the call to nfs4_check_fh into nfs4_check_file instead so that it
can be done for all stateid types.

Signed-off-by: Jeff Layton
Cc: stable@vger.kernel.org
[bfields: moved fh check to avoid NULL deref in special stateid case]
Signed-off-by: J. Bruce Fields

Jeff Layton
2015-08-01 04:30:26 +0800

31 Jul, 2015

3 commits

fc927cd32 ceph: always re-send cap flushes when MDS recovers ... Browse Code »

commit e548e9b93d3e565e42b938a99804114565be1f81 makes the kclient
only re-send cap flush once during MDS failover. If the kclient sends
a cap flush after MDS enters reconnect stage but before MDS recovers.
The kclient will skip re-sending the same cap flush when MDS recovers.

This causes problem for newly created inode. The MDS handles cap
flushes before replaying unsafe requests, so it's possible that MDS
find corresponding inode is missing when handling cap flush. The fix
is reverting to old behaviour: always re-send when MDS recovers

Signed-off-by: Yan, Zheng
Signed-off-by: Ilya Dryomov

Yan, Zheng
2015-07-31 16:38:53 +0800
f6762cb2c ceph: fix ceph_encode_locks_to_buffer() ... Browse Code »

posix locks should be in ctx->flc_posix list

Signed-off-by: Yan, Zheng
Signed-off-by: Ilya Dryomov

Yan, Zheng
2015-07-31 16:38:47 +0800
840093573 Merge tag 'xfs-for-linus-4.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs ... Browse Code »

Pull xfs fixes from Dave Chinner:
"There are a couple of recently found, long standing remote attribute
corruption fixes caused by log recovery getting confused after a
crash, and the new DAX code in XFS (merged in 4.2-rc1) needs to
actually use the DAX fault path on read faults.

Summary:

- remote attribute log recovery corruption fixes

- DAX page faults need to use direct mappings, not a page cache
mapping"

* tag 'xfs-for-linus-4.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
xfs: remote attributes need to be considered data
xfs: remote attribute headers contain an invalid LSN
xfs: call dax_fault on read page faults for DAX

Linus Torvalds
2015-07-31 11:36:49 +0800

29 Jul, 2015

4 commits

df150ed10 xfs: remote attributes need to be considered data ... Browse Code »

We don't log remote attribute contents, and instead write them
synchronously before we commit the block allocation and attribute
tree update transaction. As a result we are writing to the allocated
space before the allcoation has been made permanent.

As a result, we cannot consider this allocation to be a metadata
allocation. Metadata allocation can take blocks from the free list
and so reuse them before the transaction that freed the block is
committed to disk. This behaviour is perfectly fine for journalled
metadata changes as log recovery will ensure the free operation is
replayed before the overwrite, but for remote attribute writes this
is not the case.

Hence we have to consider the remote attribute blocks to contain
data and allocate accordingly. We do this by dropping the
XFS_BMAPI_METADATA flag from the block allocation. This means the
allocation will not use blocks that are on the busy list without
first ensuring that the freeing transaction has been committed to
disk and the blocks removed from the busy list. This ensures we will
never overwrite a freed block without first ensuring that it is
really free.

cc:
Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Dave Chinner
2015-07-29 09:48:02 +0800
e3c32ee9e xfs: remote attribute headers contain an invalid LSN ... Browse Code »

In recent testing, a system that crashed failed log recovery on
restart with a bad symlink buffer magic number:

XFS (vda): Starting recovery (logdev: internal)
XFS (vda): Bad symlink block magic!
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060

On examination of the log via xfs_logprint, none of the symlink
buffers in the log had a bad magic number, nor were any other types
of buffer log format headers mis-identified as symlink buffers.
Tracing was used to find the buffer the kernel was tripping over,
and xfs_db identified it's contents as:

000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
.....

This is a remote attribute buffer, which are notable in that they
are not logged but are instead written synchronously by the remote
attribute code so that they exist on disk before the attribute
transactions are committed to the journal.

The above remote attribute block has an invalid LSN in it - cycle
0xd002000, block 0 - which means when log recovery comes along to
determine if the transaction that writes to the underlying block
should be replayed, it sees a block that has a future LSN and so
does not replay the buffer data in the transaction. Instead, it
validates the buffer magic number and attaches the buffer verifier
to it. It is this buffer magic number check that is failing in the
above assert, indicating that we skipped replay due to the LSN of
the underlying buffer.

The problem here is that the remote attribute buffers cannot have a
valid LSN placed into them, because the transaction that contains
the attribute tree pointer changes and the block allocation that the
attribute data is being written to hasn't yet been committed. Hence
the LSN field in the attribute block is completely unwritten,
thereby leaving the underlying contents of the block in the LSN
field. It could have any value, and hence a future overwrite of the
block by log recovery may or may not work correctly.

Fix this by always writing an invalid LSN to the remote attribute
block, as any buffer in log recovery that needs to write over the
remote attribute should occur. We are protected from having old data
written over the attribute by the fact that freeing the block before
the remote attribute is written will result in the buffer being
marked stale in the log and so all changes prior to the buffer stale
transaction will be cancelled by log recovery.

Hence it is safe to ignore the LSN in the case or synchronously
written, unlogged metadata such as remote attribute blocks, and to
ensure we do that correctly, we need to write an invalid LSN to all
remote attribute blocks to trigger immediate recovery of metadata
that is written over the top.

As a further protection for filesystems that may already have remote
attribute blocks with bad LSNs on disk, change the log recovery code
to always trigger immediate recovery of metadata over remote
attribute blocks.

cc:
Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Dave Chinner
2015-07-29 09:48:01 +0800
b2442c5a7 xfs: call dax_fault on read page faults for DAX ... Browse Code »

When modifying the patch series to handle the XFS MMAP_LOCK nesting
of page faults, I botched the conversion of the read page fault
path, and so it is only every calling through the page cache. Re-add
the necessary __dax_fault() call for such files.

Because the get_blocks callback on read faults may not set up the
mapping buffer correctly to allow unwritten extent completion to be
run, we need to allow callers of __dax_fault() to pass a null
complete_unwritten() callback. The DAX code always zeros the
unwritten page when it is read faulted so there are no stale data
exposure issues with not doing the conversion. The only downside
will be the potential for increased CPU overhead on repeated read
faults of the same page. If this proves to be a problem, then the
filesystem needs to fix it's get_block callback and provide a
convert_unwritten() callback to the read fault path.

Signed-off-by: Dave Chinner
Reviewed-by: Matthew Wilcox
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Dave Chinner
2015-07-29 09:48:00 +0800
d8132e08d Merge tag 'nfs-for-4.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs ... Browse Code »

Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:

Stable patches:
- Fix a situation where the client uses the wrong (zero) stateid.
- Fix a memory leak in nfs_do_recoalesce

Bugfixes:
- Plug a memory leak when ->prepare_layoutcommit fails
- Fix an Oops in the NFSv4 open code
- Fix a backchannel deadlock
- Fix a livelock in sunrpc when sendmsg fails due to low memory
availability
- Don't revalidate the mapping if both size and change attr are up to
date
- Ensure we don't miss a file extension when doing pNFS
- Several fixes to handle NFSv4.1 sequence operation status bits
correctly
- Several pNFS layout return bugfixes"

* tag 'nfs-for-4.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (28 commits)
nfs: Fix an oops caused by using other thread's stack space in ASYNC mode
nfs: plug memory leak when ->prepare_layoutcommit fails
SUNRPC: Report TCP errors to the caller
sunrpc: translate -EAGAIN to -ENOBUFS when socket is writable.
NFSv4.2: handle NFS-specific llseek errors
NFS: Don't clear desc->pg_moreio in nfs_do_recoalesce()
NFS: Fix a memory leak in nfs_do_recoalesce
NFS: nfs_mark_for_revalidate should always set NFS_INO_REVAL_PAGECACHE
NFS: Remove the "NFS_CAP_CHANGE_ATTR" capability
NFS: Set NFS_INO_REVAL_PAGECACHE if the change attribute is uninitialised
NFS: Don't revalidate the mapping if both size and change attr are up to date
NFSv4/pnfs: Ensure we don't miss a file extension
NFSv4: We must set NFS_OPEN_STATE flag in nfs_resync_open_stateid_locked
SUNRPC: xprt_complete_bc_request must also decrement the free slot count
SUNRPC: Fix a backchannel deadlock
pNFS: Don't throw out valid layout segments
pNFS: pnfs_roc_drain() fix a race with open
pNFS: Fix races between return-on-close and layoutreturn.
pNFS: pnfs_roc_drain should return 'true' when sleeping
pNFS: Layoutreturn must invalidate all existing layout segments.
...

Linus Torvalds
2015-07-29 00:37:44 +0800

28 Jul, 2015

2 commits

a49c26911 nfs: Fix an oops caused by using other thread's stack space in ASYNC mode ... Browse Code »

An oops caused by using other thread's stack space in sunrpc ASYNC sending thread.

[ 9839.007187] ------------[ cut here ]------------
[ 9839.007923] kernel BUG at fs/nfs/nfs4xdr.c:910!
[ 9839.008069] invalid opcode: 0000 [#1] SMP
[ 9839.008069] Modules linked in: blocklayoutdriver rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm joydev iosf_mbi crct10dif_pclmul snd_timer crc32_pclmul crc32c_intel ghash_clmulni_intel snd soundcore ppdev pvpanic parport_pc i2c_piix4 serio_raw virtio_balloon parport acpi_cpufreq nfsd nfs_acl lockd grace auth_rpcgss sunrpc qxl drm_kms_helper virtio_net virtio_console virtio_blk ttm drm virtio_pci virtio_ring virtio ata_generic pata_acpi
[ 9839.008069] CPU: 0 PID: 308 Comm: kworker/0:1H Not tainted 4.0.0-0.rc4.git1.3.fc23.x86_64 #1
[ 9839.008069] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 9839.008069] Workqueue: rpciod rpc_async_schedule [sunrpc]
[ 9839.008069] task: ffff8800d8b4d8e0 ti: ffff880036678000 task.ti: ffff880036678000
[ 9839.008069] RIP: 0010:[] [] reserve_space.part.73+0x9/0x10 [nfsv4]
[ 9839.008069] RSP: 0018:ffff88003667ba58 EFLAGS: 00010246
[ 9839.008069] RAX: 0000000000000000 RBX: 000000001fc15e18 RCX: ffff8800c0193800
[ 9839.008069] RDX: ffff8800e4ae3f24 RSI: 000000001fc15e2c RDI: ffff88003667bcd0
[ 9839.008069] RBP: ffff88003667ba58 R08: ffff8800d9173008 R09: 0000000000000003
[ 9839.008069] R10: ffff88003667bcd0 R11: 000000000000000c R12: 0000000000010000
[ 9839.008069] R13: ffff8800d9173350 R14: 0000000000000000 R15: ffff8800c0067b98
[ 9839.008069] FS: 0000000000000000(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000
[ 9839.008069] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9839.008069] CR2: 00007f988c9c8bb0 CR3: 00000000d99b6000 CR4: 00000000000407f0
[ 9839.008069] Stack:
[ 9839.008069] ffff88003667bbc8 ffffffffa03412c5 00000000c6c55680 ffff880000000003
[ 9839.008069] 0000000000000088 00000010c6c55680 0001000000000002 ffffffff816e87e9
[ 9839.008069] 0000000000000000 00000000477290e2 ffff88003667bab8 ffffffff81327ba3
[ 9839.008069] Call Trace:
[ 9839.008069] [] encode_attrs+0x435/0x530 [nfsv4]
[ 9839.008069] [] ? inet_sendmsg+0x69/0xb0
[ 9839.008069] [] ? selinux_socket_sendmsg+0x23/0x30
[ 9839.008069] [] ? do_sock_sendmsg+0x9f/0xc0
[ 9839.008069] [] ? kernel_sendmsg+0x58/0x70
[ 9839.008069] [] ? xdr_reserve_space+0x20/0x170 [sunrpc]
[ 9839.008069] [] ? xdr_reserve_space+0x20/0x170 [sunrpc]
[ 9839.008069] [] ? nfs4_xdr_enc_open_noattr+0x130/0x130 [nfsv4]
[ 9839.008069] [] encode_open+0x2d5/0x340 [nfsv4]
[ 9839.008069] [] ? nfs4_xdr_enc_open_noattr+0x130/0x130 [nfsv4]
[ 9839.008069] [] ? xdr_encode_opaque+0x19/0x20 [sunrpc]
[ 9839.008069] [] ? encode_string+0x2b/0x40 [nfsv4]
[ 9839.008069] [] nfs4_xdr_enc_open+0xb3/0x140 [nfsv4]
[ 9839.008069] [] rpcauth_wrap_req+0xac/0xf0 [sunrpc]
[ 9839.008069] [] call_transmit+0x18b/0x2d0 [sunrpc]
[ 9839.008069] [] ? call_decode+0x860/0x860 [sunrpc]
[ 9839.008069] [] ? call_decode+0x860/0x860 [sunrpc]
[ 9839.008069] [] __rpc_execute+0x90/0x460 [sunrpc]
[ 9839.008069] [] rpc_async_schedule+0x15/0x20 [sunrpc]
[ 9839.008069] [] process_one_work+0x1bb/0x410
[ 9839.008069] [] worker_thread+0x53/0x470
[ 9839.008069] [] ? process_one_work+0x410/0x410
[ 9839.008069] [] ? process_one_work+0x410/0x410
[ 9839.008069] [] kthread+0xd8/0xf0
[ 9839.008069] [] ? kthread_worker_fn+0x180/0x180
[ 9839.008069] [] ret_from_fork+0x58/0x90
[ 9839.008069] [] ? kthread_worker_fn+0x180/0x180
[ 9839.008069] Code: 00 00 48 c7 c7 21 fa 37 a0 e8 94 1c d6 e0 c6 05 d2 17 05 00 01 8b 03 eb d7 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 0b 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 41 54 53 89 f3
[ 9839.008069] RIP [] reserve_space.part.73+0x9/0x10 [nfsv4]
[ 9839.008069] RSP
[ 9839.071114] ---[ end trace cc14c03adb522e94 ]---

Signed-off-by: Kinglong Mee
Signed-off-by: Trond Myklebust

Kinglong Mee
2015-07-28 21:07:03 +0800
3471648a7 nfs: plug memory leak when ->prepare_layoutcommit fails ... Browse Code »

"data" is currently leaked when the prepare_layoutcommit operation
returns an error. Put the cred before taking the spinlock in that
case, take the lock and then goto out_unlock which will drop the
lock and then free "data".

Signed-off-by: Jeff Layton
Signed-off-by: Trond Myklebust

Jeff Layton
2015-07-28 21:07:02 +0800

27 Jul, 2015

3 commits

bdcc2cd14 NFSv4.2: handle NFS-specific llseek errors ... Browse Code »

Handle NFS-specific llseek errors instead of letting them leak out to
userspace.

Reported-by: Benjamin Coddington
Signed-off-by: J. Bruce Fields
Signed-off-by: Trond Myklebust

J. Bruce Fields
2015-07-27 23:16:25 +0800
d4c30454d NFS: Don't clear desc->pg_moreio in nfs_do_recoalesce() ... Browse Code »

Recoalescing does not affect whether or not we've already sent off
I/O, and doing so means that we end up sending a bunch of synchronous
for cases where we actually need to be using unstable writes.

Signed-off-by: Trond Myklebust

Trond Myklebust
2015-07-27 22:33:12 +0800
03d5eb65b NFS: Fix a memory leak in nfs_do_recoalesce ... Browse Code »

If the function exits early, then we must put those requests that were
not processed back onto the &mirror->pg_list so they can be cleaned up
by nfs_pgio_error().

Fixes: a7d42ddb30997 ("nfs: add mirroring support to pgio layer")
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Trond Myklebust

Trond Myklebust
2015-07-27 22:33:08 +0800

25 Jul, 2015

3 commits

6282adbf9 f2fs: call set_page_dirty to attach i_wb for cgroup ... Browse Code »

The cgroup attaches inode->i_wb via mark_inode_dirty and when set_page_writeback
is called, __inc_wb_stat() updates i_wb's stat.

So, we need to explicitly call set_page_dirty->__mark_inode_dirty in prior to
any writebacking pages.

This patch should resolve the following kernel panic reported by Andreas Reis.

https://bugzilla.kernel.org/show_bug.cgi?id=101801

--- Comment #2 from Andreas Reis ---
BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
IP: [] __percpu_counter_add+0x1a/0x90
PGD 2951ff067 PUD 2df43f067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
Modules linked in:
CPU: 7 PID: 10356 Comm: gcc Tainted: G W 4.2.0-1-cu #1
Hardware name: Gigabyte Technology Co., Ltd. G1.Sniper M5/G1.Sniper M5, BIOS
T01 02/03/2015
task: ffff880295044f80 ti: ffff880295140000 task.ti: ffff880295140000
RIP: 0010:[] []
__percpu_counter_add+0x1a/0x90
RSP: 0018:ffff880295143ac8 EFLAGS: 00010082
RAX: 0000000000000003 RBX: ffffea000a526d40 RCX: 0000000000000001
RDX: 0000000000000020 RSI: 0000000000000001 RDI: 0000000000000088
RBP: ffff880295143ae8 R08: 0000000000000000 R09: ffff88008f69bb30
R10: 00000000fffffffa R11: 0000000000000000 R12: 0000000000000088
R13: 0000000000000001 R14: ffff88041d099000 R15: ffff880084a205d0
FS: 00007f8549374700(0000) GS:ffff88042f3c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000a8 CR3: 000000033e1d5000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
0000000000000000 ffffea000a526d40 ffff880084a20738 ffff880084a20750
ffff880295143b48 ffffffff811cc91e ffff880000000000 0000000000000296
0000000000000000 ffff880417090198 0000000000000000 ffffea000a526d40
Call Trace:
[] __test_set_page_writeback+0xde/0x1d0
[] do_write_data_page+0xe7/0x3a0
[] gc_data_segment+0x5aa/0x640
[] do_garbage_collect+0x138/0x150
[] f2fs_gc+0x1be/0x3e0
[] f2fs_balance_fs+0x81/0x90
[] f2fs_unlink+0x47/0x1d0
[] vfs_unlink+0x109/0x1b0
[] do_unlinkat+0x287/0x2c0
[] SyS_unlink+0x16/0x20
[] entry_SYSCALL_64_fastpath+0x12/0x71
Code: 41 5e 5d c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 55 49
89 f5 41 54 49 89 fc 53 48 83 ec 08 65 ff 05 e6 d9 b6 7e 8b 47 20 48 63 ca
65 8b 18 48 63 db 48 01 f3 48 39 cb 7d 0a
RIP [] __percpu_counter_add+0x1a/0x90
RSP
CR2: 00000000000000a8
---[ end trace 5132449a58ed93a3 ]---
note: gcc[10356] exited with preempt_count 2

Signed-off-by: Jaegeuk Kim

Jaegeuk Kim
2015-07-25 23:54:26 +0800
548aedac5 f2fs: handle error cases in move_encrypted_block ... Browse Code »

This patch fixes some missing error handlers.

Reviewed-by: Chao Yu
Signed-off-by: Jaegeuk Kim

Jaegeuk Kim
2015-07-25 23:54:25 +0800
33b40178c Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block fixes from Jens Axboe:
"Four smaller fixes for the current series. This contains:

- A fix for clones of discard bio's, that can cause data corruption.
From Martin.

- A fix for null_blk, where in certain queue modes it could access a
request after it had been freed. From Mike Krinkin.

- An error handling leak fix for blkcg, from Tejun.

- Also from Tejun, export of the functions that a file system needs
to implement cgroup writeback support"

* 'for-linus' of git://git.kernel.dk/linux-block:
block: Do a full clone when splitting discard bios
block: export bio_associate_*() and wbc_account_io()
blkcg: fix gendisk reference leak in blkg_conf_prep()
null_blk: fix use-after-free problem

Linus Torvalds
2015-07-25 08:00:04 +0800

24 Jul, 2015

3 commits

45b4b782e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull namespace fixes from Eric Biederman:
"While reading through the code of detach_mounts I realized the code
was slightly off. Testing it revealed two buggy corner cases that can
send the code of detach_mounts into an infinite loop.

Fixing the code to do the right thing removes the possibility of these
user triggered infinite loops in the code"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
mnt: In detach_mounts detach the appropriate unmounted mount
mnt: Clarify and correct the disconnect logic in umount_tree

Linus Torvalds
2015-07-24 04:16:21 +0800
5aa2a96b3 block: export bio_associate_*() and wbc_account_io() ... Browse Code »

bio_associate_blkcg(), bio_associate_current() and wbc_account_io()
are used to implement cgroup writeback support for filesystems and
thus need to be exported. Export them.

Signed-off-by: Tejun Heo
Reported-by: Stephen Rothwell
Signed-off-by: Jens Axboe

Tejun Heo
2015-07-24 03:36:44 +0800
fe78fcc85 mnt: In detach_mounts detach the appropriate unmounted mount ... Browse Code »

The handling of in detach_mounts of unmounted but connected mounts is
buggy and can lead to an infinite loop.

Correct the handling of unmounted mounts in detach_mount. When the
mountpoint of an unmounted but connected mount is connected to a
dentry, and that dentry is deleted we need to disconnect that mount
from the parent mount and the deleted dentry.

Nothing changes for the unmounted and connected children. They can be
safely ignored.

Cc: stable@vger.kernel.org
Fixes: ce07d891a0891d3c0d0c2d73d577490486b809e1 mnt: Honor MNT_LOCKED when detaching mounts
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2015-07-24 00:31:15 +0800

23 Jul, 2015

4 commits

f2d0a123b mnt: Clarify and correct the disconnect logic in umount_tree ... Browse Code »

rmdir mntpoint will result in an infinite loop when there is
a mount locked on the mountpoint in another mount namespace.

This is because the logic to test to see if a mount should
be disconnected in umount_tree is buggy.

Move the logic to decide if a mount should remain connected to
it's mountpoint into it's own function disconnect_mount so that
clarity of expression instead of terseness of expression becomes
a virtue.

When the conditions where it is invalid to leave a mount connected
are first ruled out, the logic for deciding if a mount should
be disconnected becomes much clearer and simpler.

Fixes: e0c9c0afd2fc958ffa34b697972721d81df8a56f mnt: Update detach_mounts to leave mounts connected
Fixes: ce07d891a0891d3c0d0c2d73d577490486b809e1 mnt: Honor MNT_LOCKED when detaching mounts
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2015-07-23 09:33:27 +0800
00d80e342 Btrfs: fix quick exhaustion of the system array in the superblock ... Browse Code »

Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when
finishing block group creation"), introduced in 4.2-rc1, the following
test was failing due to exhaustion of the system array in the superblock:

#!/bin/bash

truncate -s 100T big.img
mkfs.btrfs big.img
mount -o loop big.img /mnt/loop

num=5
sz=10T
for ((i = 0; i < $num; i++)); do
echo fallocate $i $sz
fallocate -l $sz /mnt/loop/testfile$i
done
btrfs filesystem sync /mnt/loop

for ((i = 0; i < $num; i++)); do
echo rm $i
rm /mnt/loop/testfile$i
btrfs filesystem sync /mnt/loop
done
umount /mnt/loop

This made btrfs_add_system_chunk() fail with -EFBIG due to excessive
allocation of system block groups. This happened because the test creates
a large number of data block groups per transaction and when committing
the transaction we start the writeout of the block group caches for all
the new new (dirty) block groups, which results in pre-allocating space
for each block group's free space cache using the same transaction handle.
That in turn often leads to creation of more block groups, and all get
attached to the new_bgs list of the same transaction handle to the point
of getting a list with over 1500 elements, and creation of new block groups
leads to the need of reserving space in the chunk block reserve and often
creating a new system block group too.

So that made us quickly exhaust the chunk block reserve/system space info,
because as of the commit mentioned before, we do reserve space for each
new block group in the chunk block reserve, unlike before where we would
not and would at most allocate one new system block group and therefore
would only ensure that there was enough space in the system space info to
allocate 1 new block group even if we ended up allocating thousands of
new block groups using the same transaction handle. That worked most of
the time because the computed required space at check_system_chunk() is
very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and
that all nodes/leafs in a path will be COWed and split) and since the
updates to the chunk tree all happen at btrfs_create_pending_block_groups
it is unlikely that a path needs to be COWed more than once (unless
writepages() for the btree inode is called by mm in between) and that
compensated for the need of creating any new nodes/leads in the chunk
tree.

So fix this by ensuring we don't accumulate a too large list of new block
groups in a transaction's handles new_bgs list, inserting/updating the
chunk tree for all accumulated new block groups and releasing the unused
space from the chunk block reserve whenever the list becomes sufficiently
large. This is a generic solution even though the problem currently can
only happen when starting the writeout of the free space caches for all
dirty block groups (btrfs_start_dirty_block_groups()).

Reported-by: Omar Sandoval
Signed-off-by: Filipe Manana
Tested-by: Omar Sandoval
Signed-off-by: Chris Mason

Filipe Manana
2015-07-23 09:20:54 +0800
3e303ea60 btrfs: its btrfs_err() instead of btrfs_error() ... Browse Code »

sorry I indented to use btrfs_err() and I have no idea
how btrfs_error() got there.
infact I was thinking about these kind of oversights
since these two func are too closely named.

Signed-off-by: Anand Jain
Reviewed-by: Liu Bo
Reviewed-by: David Sterba
Signed-off-by: Chris Mason

Anand Jain
2015-07-23 09:20:53 +0800
95ab1f649 btrfs: Avoid NULL pointer dereference of free_extent_buffer when read_tree_block() fail ... Browse Code »

When read_tree_block() failed, we can see following dmesg:
[ 134.371389] BUG: unable to handle kernel NULL pointer dereference at 0000000000000063
[ 134.372236] IP: [] free_extent_buffer+0x21/0x90
[ 134.372236] PGD 0
[ 134.372236] Oops: 0000 [#1] SMP
[ 134.372236] Modules linked in:
[ 134.372236] CPU: 0 PID: 2289 Comm: mount Not tainted 4.2.0-rc1_HEAD_c65b99f046843d2455aa231747b5a07a999a9f3d_+ #115
[ 134.372236] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014
[ 134.372236] task: ffff88003b6e1a00 ti: ffff880011e60000 task.ti: ffff880011e60000
[ 134.372236] RIP: 0010:[] [] free_extent_buffer+0x21/0x90
...
[ 134.372236] Call Trace:
[ 134.372236] [] free_root_extent_buffers+0x91/0xb0
[ 134.372236] [] free_root_pointers+0x17d/0x190
[ 134.372236] [] open_ctree+0x1ca0/0x25b0
[ 134.372236] [] ? disk_name+0x97/0xb0
[ 134.372236] [] btrfs_mount+0x8fa/0xab0
...

Reason:
read_tree_block() changed to return error number on fail,
and this value(not NULL) is set to tree_root->node, then subsequent
code will run to:
free_root_pointers()
->free_root_extent_buffers()
->free_extent_buffer()
->atomic_read((extent_buffer *)(-E_XXX)->refs);
and trigger above error.

Fix:
Set tree_root->node to NULL on fail to make error_handle code
happy.

Signed-off-by: Zhao Lei
Signed-off-by: Chris Mason

Zhao Lei
2015-07-23 09:20:52 +0800