Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

11 Sep, 2014

1 commit

7ec62d421 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs ... Browse Code »

Pull UDF fixes from Jan Kara:
"Fixes for UDF handling of NFS handles and one fix for proper handling
of corrupted media"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: saner calling conventions for udf_new_inode()
udf: fix the udf_iget() vs. udf_new_inode() races
udf: merge the pieces inserting a new non-directory object into directory
udf: Set i_generation field
udf: Properly detect stale inodes
udf: Make udf_read_inode() and udf_iget() return error
udf: Avoid infinite loop when processing indirect ICBs
udf: Fold udf_fill_inode() into __udf_read_inode()
udf: Avoid dir link count to go negative

Linus Torvalds
2014-09-11 05:04:17 +0800

10 Sep, 2014

1 commit

e874a5fe3 Merge branch 'for-next-3.17' of git://git.samba.org/sfrench/cifs-2.6 ... Browse Code »

Pull cifs/smb3 fixes from Steve French:
"This includes various cifs and smb3 bug fixes including those for bugs
found with the recently updated xfstests.

Also I am working fixes for two additional cifs problems found by
xfstests which I plan to send later (when reviewed and run additional
tests)"

* 'for-next-3.17' of git://git.samba.org/sfrench/cifs-2.6:
Clarify Kconfig help text for CIFS and SMB2/SMB3
CIFS: Fix wrong filename length for SMB2
CIFS: Fix wrong restart readdir for SMB1
CIFS: Fix directory rename error
cifs: No need to send SIGKILL to demux_thread during umount
cifs: Allow directIO read/write during cache=strict
cifs: remove unneeded check of null checking in if condition
cifs: fix a possible use of uninit variable in SMB2_sess_setup
cifs: fix memory leak when password is supplied multiple times
cifs: fix a possible null pointer deref in decode_ascii_ssetup
Trivial whitespace fix

Linus Torvalds
2014-09-10 08:00:43 +0800

09 Sep, 2014

4 commits

8c68face5 Merge branch 'for_linus_urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 ... Browse Code »

Pull ext4 bugfix from Ted Ts'o.

[ Hmm. It's possible we should make kfree() aware of error pointers,
and use IS_ERR_OR_NULL rather than a NULL check. But in the meantime
this is obviously the right fix. - Linus ]

* 'for_linus_urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: avoid trying to kfree an ERR_PTR pointer

Linus Torvalds
2014-09-09 06:51:01 +0800
861b7102b Merge branch 'for-3.17' of git://linux-nfs.org/~bfields/linux ... Browse Code »

Pull nfsd bugfixes from Bruce Fields:
"A couple minor nfsd bugfixes"

* 'for-3.17' of git://linux-nfs.org/~bfields/linux:
lockd: fix rpcbind crash on lockd startup failure
nfsd4: fix rd_dircount enforcement

Linus Torvalds
2014-09-09 06:18:06 +0800
7c17705e7 lockd: fix rpcbind crash on lockd startup failure ... Browse Code »
5

Nikita Yuschenko reported that booting a kernel with init=/bin/sh and
then nfs mounting without portmap or rpcbind running using a busybox
mount resulted in:

# mount -t nfs 10.30.130.21:/opt /mnt
svc: failed to register lockdv1 RPC service (errno 111).
lockd_up: makesock failed, error=-111
Unable to handle kernel paging request for data at address 0x00000030
Faulting instruction address: 0xc055e65c
Oops: Kernel access of bad area, sig: 11 [#1]
MPC85xx CDS
Modules linked in:
CPU: 0 PID: 1338 Comm: mount Not tainted 3.10.44.cge #117
task: cf29cea0 ti: cf35c000 task.ti: cf35c000
NIP: c055e65c LR: c0566490 CTR: c055e648
REGS: cf35dad0 TRAP: 0300 Not tainted (3.10.44.cge)
MSR: 00029000 CR: 22442488 XER: 20000000
DEAR: 00000030, ESR: 00000000

GPR00: c05606f4 cf35db80 cf29cea0 cf0ded80 cf0dedb8 00000001 1dec3086
00000000
GPR08: 00000000 c07b1640 00000007 1dec3086 22442482 100b9758 00000000
10090ae8
GPR16: 00000000 000186a5 00000000 00000000 100c3018 bfa46edc 100b0000
bfa46ef0
GPR24: cf386ae0 c07834f0 00000000 c0565f88 00000001 cf0dedb8 00000000
cf0ded80
NIP [c055e65c] call_start+0x14/0x34
LR [c0566490] __rpc_execute+0x70/0x250
Call Trace:
[cf35db80] [00000080] 0x80 (unreliable)
[cf35dbb0] [c05606f4] rpc_run_task+0x9c/0xc4
[cf35dbc0] [c0560840] rpc_call_sync+0x50/0xb8
[cf35dbf0] [c056ee90] rpcb_register_call+0x54/0x84
[cf35dc10] [c056f24c] rpcb_register+0xf8/0x10c
[cf35dc70] [c0569e18] svc_unregister.isra.23+0x100/0x108
[cf35dc90] [c0569e38] svc_rpcb_cleanup+0x18/0x30
[cf35dca0] [c0198c5c] lockd_up+0x1dc/0x2e0
[cf35dcd0] [c0195348] nlmclnt_init+0x2c/0xc8
[cf35dcf0] [c015bb5c] nfs_start_lockd+0x98/0xec
[cf35dd20] [c015ce6c] nfs_create_server+0x1e8/0x3f4
[cf35dd90] [c0171590] nfs3_create_server+0x10/0x44
[cf35dda0] [c016528c] nfs_try_mount+0x158/0x1e4
[cf35de20] [c01670d0] nfs_fs_mount+0x434/0x8c8
[cf35de70] [c00cd3bc] mount_fs+0x20/0xbc
[cf35de90] [c00e4f88] vfs_kern_mount+0x50/0x104
[cf35dec0] [c00e6e0c] do_mount+0x1d0/0x8e0
[cf35df10] [c00e75ac] SyS_mount+0x90/0xd0
[cf35df40] [c000ccf4] ret_from_syscall+0x0/0x3c

The addition of svc_shutdown_net() resulted in two calls to
svc_rpcb_cleanup(); the second is no longer necessary and crashes when
it calls rpcb_register_call with clnt=NULL.

Reported-by: Nikita Yushchenko
Fixes: 679b033df484 "lockd: ensure we tear down any live sockets when socket creation fails during lockd_up"
Cc: stable@vger.kernel.org
Acked-by: Jeff Layton
Signed-off-by: J. Bruce Fields

J. Bruce Fields
2014-09-09 00:03:32 +0800
aee377644 nfsd4: fix rd_dircount enforcement ... Browse Code »

Commit 3b299709091b "nfsd4: enforce rd_dircount" totally misunderstood
rd_dircount; it refers to total non-attribute bytes returned, not number
of directory entries returned.

Bring the code into agreement with RFC 3530 section 14.2.24.

Cc: stable@vger.kernel.org
Fixes: 3b299709091b "nfsd4: enforce rd_dircount"
Signed-off-by: J. Bruce Fields

J. Bruce Fields
2014-09-09 00:02:03 +0800

08 Sep, 2014

2 commits

9142eadef Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull filesystem fixes from Al Viro:
"Several bugfixes (all of them -stable fodder).

Alexey's one deals with double mutex_lock() in UFS (apparently, nobody
has tried to test "ufs: sb mutex merge + mutex_destroy" on something
like file creation/removal on ufs). Mine deal with two kinds of
umount bugs, in umount propagation and in handling of automounted
submounts, both resulting in bogus transient EBUSY from umount"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ufs: fix deadlocks introduced by sb mutex merge
fix EBUSY on umount() from MNT_SHRINKABLE
get rid of propagate_umount() mistakenly treating slaves as busy.

Linus Torvalds
2014-09-08 01:59:58 +0800
9ef7db7f3 ufs: fix deadlocks introduced by sb mutex merge ... Browse Code »
13

Commit 0244756edc4b ("ufs: sb mutex merge + mutex_destroy") introduces
deadlocks in ufs_new_inode() and ufs_free_inode().
Most callers of that functions acqure the mutex by themselves and
ufs_{new,free}_inode() do that via lock_ufs(),
i.e we have an unavoidable double lock.

The patch proposes to resolve the issue by making sure that
ufs_{new,free}_inode() are not called with the mutex held.

Found by Linux Driver Verification project (linuxtesting.org).

Cc: stable@vger.kernel.org # 3.16
Signed-off-by: Alexey Khoroshilov
Signed-off-by: Al Viro

Alexey Khoroshilov
2014-09-08 01:26:39 +0800

07 Sep, 2014

1 commit

11e973981 Merge tag 'xfs-for-linus-3.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs ... Browse Code »

Pull xfs fixes from Dave Chinner:
"The fixes all address recently discovered data corruption issues.

The original Direct IO issue was discovered by Chris Mason @ Facebook
on a production workload which mixed buffered reads with direct reads
and writes IO to the same file. The fix for that exposed other issues
with page invalidation (exposed by millions of fsx operations) failing
due to dirty buffers beyond EOF.

Finally, the collapse_range code could also cause problems due to
racing writeback changing the extent map while it was being shifted
around. The commits for that problem are simple mitigation fixes that
prevent the problem from occuring. A more robust fix for 3.18 that
addresses the underlying problem is currently being worked on by
Brian.

Summary of fixes:
- a direct IO read/buffered read data corruption
- the associated fallout from the DIO data corruption fix
- collapse range bugs that are potential data corruption issues"

* tag 'xfs-for-linus-3.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
xfs: trim eofblocks before collapse range
xfs: xfs_file_collapse_range is delalloc challenged
xfs: don't log inode unless extent shift makes extent modifications
xfs: use ranged writeback and invalidation for direct IO
xfs: don't zero partial page cache pages during O_DIRECT writes
xfs: don't zero partial page cache pages during O_DIRECT writes
xfs: don't dirty buffers beyond EOF

Linus Torvalds
2014-09-07 03:13:17 +0800

05 Sep, 2014

9 commits

10096fb10 Export sync_filesystem() for modular ->remount_fs() use ... Browse Code »

This patch changes sync_filesystem() to be EXPORT_SYMBOL().

The reason this is needed is that starting with 3.15 kernel, due to
Theodore Ts'o's commit 02b9984d6408 ("fs: push sync_filesystem() down to
the file system's remount_fs()"), all file systems that have dirty data
to be written out need to call sync_filesystem() from their
->remount_fs() method when remounting read-only.

As this is now a generically required function rather than an internal
only function it should be EXPORT_SYMBOL() so that all file systems can
call it.

Signed-off-by: Anton Altaparmakov
Acked-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anton Altaparmakov
2014-09-05 23:16:21 +0800
b7fece1be Merge git://git.kvack.org/~bcrl/aio-fixes ... Browse Code »

Pull aio bugfixes from Ben LaHaise:
"Two small fixes"

* git://git.kvack.org/~bcrl/aio-fixes:
aio: block exit_aio() until all context requests are completed
aio: add missing smp_rmb() in read_events_ring

Linus Torvalds
2014-09-05 07:08:55 +0800
6098b45b3 aio: block exit_aio() until all context requests are completed ... Browse Code »
5

It seems that exit_aio() also needs to wait for all iocbs to complete (like
io_destroy), but we missed the wait step in current implemention, so fix
it in the same way as we did in io_destroy.

Signed-off-by: Gu Zheng
Signed-off-by: Benjamin LaHaise
Cc: stable@vger.kernel.org

Gu Zheng
2014-09-05 04:54:47 +0800
0b93a92be udf: saner calling conventions for udf_new_inode() ... Browse Code »

Signed-off-by: Al Viro
Signed-off-by: Jan Kara

Al Viro
2014-09-05 03:37:41 +0800
b23150961 udf: fix the udf_iget() vs. udf_new_inode() races ... Browse Code »

Currently udf_iget() (triggered by NFS) can race with udf_new_inode()
leading to two inode structures with the same inode number:

nfsd: iget_locked() creates inode
nfsd: try to read from disk, block on that.
udf_new_inode(): allocate inode with that inumber
udf_new_inode(): insert it into icache, set it up and dirty
udf_write_inode(): write inode into buffer cache
nfsd: get CPU again, look into buffer cache, see nice and sane on-disk
inode, set the in-core inode from it

Fix the problem by putting inode into icache in locked state (I_NEW set)
and unlocking it only after it's fully set up.

Signed-off-by: Al Viro
Signed-off-by: Jan Kara

Al Viro
2014-09-05 03:37:41 +0800
d2be51cb3 udf: merge the pieces inserting a new non-directory object into directory ... Browse Code »

boilerplate code in udf_{create,mknod,symlink} taken to new helper

symlink case converted to unique id calculated by udf_new_inode() - no
point finding a new one.

Signed-off-by: Al Viro
Signed-off-by: Jan Kara

Al Viro
2014-09-05 03:37:40 +0800
470cca56c udf: Set i_generation field ... Browse Code »

Currently UDF doesn't initialize i_generation in any way and thus NFS
can easily get reallocated inodes from stale file handles. Luckily UDF
already has a unique object identifier associated with each inode -
i_unique. Use that for initialization of i_generation.

Signed-off-by: Jan Kara

Jan Kara
2014-09-05 03:37:40 +0800
4071b9136 udf: Properly detect stale inodes ... Browse Code »
26

NFS can easily ask for inodes that are already deleted. Currently UDF
happily returns such inodes which is a bug. Return -ESTALE if
udf_read_inode() is asked to read deleted inode.

Signed-off-by: Jan Kara

Jan Kara
2014-09-05 03:37:39 +0800
6d3d5e860 udf: Make udf_read_inode() and udf_iget() return error ... Browse Code »

Currently __udf_read_inode() wasn't returning anything and we found out
whether we succeeded reading inode by checking whether inode is bad or
not. udf_iget() returned NULL on failure and inode pointer otherwise.
Make these two functions properly propagate errors up the call stack and
use the return value in callers.

Signed-off-by: Jan Kara

Jan Kara
2014-09-05 03:36:35 +0800

04 Sep, 2014

4 commits

c03aa9f6e udf: Avoid infinite loop when processing indirect ICBs ... Browse Code »
5

We did not implement any bound on number of indirect ICBs we follow when
loading inode. Thus corrupted medium could cause kernel to go into an
infinite loop, possibly causing a stack overflow.

Fix the possible stack overflow by removing recursion from
__udf_read_inode() and limit number of indirect ICBs we follow to avoid
infinite loops.

Signed-off-by: Jan Kara

Jan Kara
2014-09-04 20:12:29 +0800
bb7720a0b udf: Fold udf_fill_inode() into __udf_read_inode() ... Browse Code »

There's no good reason to separate these since udf_fill_inode() is
called only from __udf_read_inode() and both do part of the same thing.

Signed-off-by: Jan Kara

Jan Kara
2014-09-04 19:32:50 +0800
8a70ee330 udf: Avoid dir link count to go negative ... Browse Code »

If we are writing back inode of unlinked directory, its link count ends
up being (u16)-1. Although the inode is deleted, udf_iget() can load the
inode when NFS uses stale file handle and get confused.

Signed-off-by: Jan Kara

Jan Kara
2014-09-04 17:47:51 +0800
70c8038dd Merge tag 'for-f2fs-3.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs ... Browse Code »

Pull f2fs bug fixes from Jaegeuk Kim:
"This series includes patches to:

- fix recovery routines
- fix bugs related to inline_data/xattr
- fix when casting the dentry names
- handle EIO or ENOMEM correctly
- fix memory leak
- fix lock coverage"

* tag 'for-f2fs-3.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (28 commits)
f2fs: reposition unlock_new_inode to prevent accessing invalid inode
f2fs: fix wrong casting for dentry name
f2fs: simplify by using a literal
f2fs: truncate stale block for inline_data
f2fs: use macro for code readability
f2fs: introduce need_do_checkpoint for readability
f2fs: fix incorrect calculation with total/free inode num
f2fs: remove rename and use rename2
f2fs: skip if inline_data was converted already
f2fs: remove rewrite_node_page
f2fs: avoid double lock in truncate_blocks
f2fs: prevent checkpoint during roll-forward
f2fs: add WARN_ON in f2fs_bug_on
f2fs: handle EIO not to break fs consistency
f2fs: check s_dirty under cp_mutex
f2fs: unlock_page when node page is redirtied out
f2fs: introduce f2fs_cp_error for readability
f2fs: give a chance to mount again when encountering errors
f2fs: trigger release_dirty_inode in f2fs_put_super
f2fs: don't skip checkpoint if there is no dirty node pages
...

Linus Torvalds
2014-09-04 01:10:28 +0800

03 Sep, 2014

2 commits

a9cfcd63e ext4: avoid trying to kfree an ERR_PTR pointer ... Browse Code »

Thanks to Dan Carpenter for extending smatch to find bugs like this.
(This was found using a development version of smatch.)

Fixes: 36de928641ee48b2078d3fe9514242aaa2f92013
Reported-by: Dan Carpenter
Cc: stable@vger.kernel.org

Theodore Ts'o
2014-09-03 21:37:30 +0800
2ff396be6 aio: add missing smp_rmb() in read_events_ring ... Browse Code »
6

We ran into a case on ppc64 running mariadb where io_getevents would
return zeroed out I/O events. After adding instrumentation, it became
clear that there was some missing synchronization between reading the
tail pointer and the events themselves. This small patch fixes the
problem in testing.

Thanks to Zach for helping to look into this, and suggesting the fix.

Signed-off-by: Jeff Moyer
Signed-off-by: Benjamin LaHaise
Cc: stable@vger.kernel.org

Jeff Moyer
2014-09-03 03:20:03 +0800

02 Sep, 2014

8 commits

b73e52824 f2fs: reposition unlock_new_inode to prevent accessing invalid inode ... Browse Code »

As the race condition on the inode cache, following scenario can appear:
[Thread a] [Thread b]
->f2fs_mkdir
->f2fs_add_link
->__f2fs_add_link
->init_inode_metadata failed here
->gc_thread_func
->f2fs_gc
->do_garbage_collect
->gc_data_segment
->f2fs_iget
->iget_locked
->wait_on_inode
->unlock_new_inode
->move_data_page
->make_bad_inode
->iput

When we fail in create/symlink/mkdir/mknod/tmpfile, the new allocated inode
should be set as bad to avoid being accessed by other thread. But in above
scenario, it allows f2fs to access the invalid inode before this inode was set
as bad.
This patch fix the potential problem, and this issue was found by code review.

change log from v1:
o Add condition judgment in gc_data_segment() suggested by Changman Lee.
o use iget_failed to simplify code.

Signed-off-by: Chao Yu
Signed-off-by: Jaegeuk Kim

Chao Yu
2014-09-02 15:22:24 +0800
41b9d7263 xfs: trim eofblocks before collapse range ... Browse Code »

xfs_collapse_file_space() currently writes back the entire file
undergoing collapse range to settle things down for the extent shift
algorithm. While this prevents changes to the extent list during the
collapse operation, the writeback itself is not enough to prevent
unnecessary collapse failures.

The current shift algorithm uses the extent index to iterate the in-core
extent list. If a post-eof delalloc extent persists after the writeback
(e.g., a prior zero range op where the end of the range aligns with eof
can separate the post-eof blocks such that they are not written back and
converted), xfs_bmap_shift_extents() becomes confused over the encoded
br_startblock value and fails the collapse.

As with the full writeback, this is a temporary fix until the algorithm
is improved to cope with a volatile extent list and avoid attempts to
shift post-eof extents.

Signed-off-by: Brian Foster
Reviewed-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Brian Foster
2014-09-02 10:12:53 +0800
1669a8ca2 xfs: xfs_file_collapse_range is delalloc challenged ... Browse Code »

If we have delalloc extents on a file before we run a collapse range
opertaion, we sync the range that we are going to collapse to
convert delalloc extents in that region to real extents to simplify
the shift operation.

However, the shift operation then assumes that the extent list is
not going to change as it iterates over the extent list moving
things about. Unfortunately, this isn't true because we can't hold
the ILOCK over all the operations. We can prevent new IO from
modifying the extent list by holding the IOLOCK, but that doesn't
prevent writeback from running....

And when writeback runs, it can convert delalloc extents is the
range of the file prior to the region being collapsed, and this
changes the indexes of all the extents in the file. That causes the
collapse range operation to Go Bad.

The right fix is to rewrite the extent shift operation not to be
dependent on the extent list not changing across the entire
operation, but this is a fairly significant piece of work to do.
Hence, as a short-term workaround for the problem, sync the entire
file before starting a collapse operation to remove all delalloc
ranges from the file and so avoid the problem of concurrent
writeback changing the extent list.

Diagnosed-and-Reported-by: Brian Foster
Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Dave Chinner
2014-09-02 10:12:53 +0800
ca446d880 xfs: don't log inode unless extent shift makes extent modifications ... Browse Code »

The file collapse mechanism uses xfs_bmap_shift_extents() to collapse
all subsequent extents down into the specified, previously punched out,
region. This function performs some validation, such as whether a
sufficient hole exists in the target region of the collapse, then shifts
the remaining exents downward.

The exit path of the function currently logs the inode unconditionally.
While we must log the inode (and abort) if an error occurs and the
transaction is dirty, the initial validation paths can generate errors
before the transaction has been dirtied. This creates an unnecessary
filesystem shutdown scenario, as the caller will cancel a transaction
that has been marked dirty.

Modify xfs_bmap_shift_extents() to OR the logflags bits as modifications
are made to the inode bmap. Only log the inode in the exit path if
logflags has been set. This ensures we only have to cancel a dirty
transaction if modifications have been made and prevents an unnecessary
filesystem shutdown otherwise.

Signed-off-by: Brian Foster
Reviewed-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Brian Foster
2014-09-02 10:12:53 +0800
7d4ea3ce6 xfs: use ranged writeback and invalidation for direct IO ... Browse Code »
5

Now we are not doing silly things with dirtying buffers beyond EOF
and using invalidation correctly, we can finally reduce the ranges of
writeback and invalidation used by direct IO to match that of the IO
being issued.

Bring the writeback and invalidation ranges back to match the
generic direct IO code - this will greatly reduce the perturbation
of cached data when direct IO and buffered IO are mixed, but still
provide the same buffered vs direct IO coherency behaviour we
currently have.

Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Dave Chinner
2014-09-02 10:12:53 +0800
834ffca6f xfs: don't zero partial page cache pages during O_DIRECT writes ... Browse Code »
6

Similar to direct IO reads, direct IO writes are using
truncate_pagecache_range to invalidate the page cache. This is
incorrect due to the sub-block zeroing in the page cache that
truncate_pagecache_range() triggers.

This patch fixes things by using invalidate_inode_pages2_range
instead. It preserves the page cache invalidation, but won't zero
any pages.

cc: stable@vger.kernel.org
Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Dave Chinner
2014-09-02 10:12:52 +0800
85e584da3 xfs: don't zero partial page cache pages during O_DIRECT writes ... Browse Code »
6

xfs is using truncate_pagecache_range to invalidate the page cache
during DIO reads. This is different from the other filesystems who
only invalidate pages during DIO writes.

truncate_pagecache_range is meant to be used when we are freeing the
underlying data structs from disk, so it will zero any partial
ranges in the page. This means a DIO read can zero out part of the
page cache page, and it is possible the page will stay in cache.

buffered reads will find an up to date page with zeros instead of
the data actually on disk.

This patch fixes things by using invalidate_inode_pages2_range
instead. It preserves the page cache invalidation, but won't zero
any pages.

[dchinner: catch error and warn if it fails. Comment.]

cc: stable@vger.kernel.org
Signed-off-by: Chris Mason
Reviewed-by: Dave Chinner
Reviewed-by: Brian Foster
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Chris Mason
2014-09-02 10:12:52 +0800
22e757a49 xfs: don't dirty buffers beyond EOF ... Browse Code »
6

generic/263 is failing fsx at this point with a page spanning
EOF that cannot be invalidated. The operations are:

1190 mapwrite 0x52c00 thru 0x5e569 (0xb96a bytes)
1191 mapread 0x5c000 thru 0x5d636 (0x1637 bytes)
1192 write 0x5b600 thru 0x771ff (0x1bc00 bytes)

where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
write attempts to invalidate the cached page over this range, it
fails with -EBUSY and so any attempt to do page invalidation fails.

The real question is this: Why can't that page be invalidated after
it has been written to disk and cleaned?

Well, there's data on the first two buffers in the page (1k block
size, 4k page), but the third buffer on the page (i.e. beyond EOF)
is failing drop_buffers because it's bh->b_state == 0x3, which is
BH_Uptodate | BH_Dirty. IOWs, there's dirty buffers beyond EOF. Say
what?

OK, set_buffer_dirty() is called on all buffers from
__set_page_buffers_dirty(), regardless of whether the buffer is
beyond EOF or not, which means that when we get to ->writepage,
we have buffers marked dirty beyond EOF that we need to clean.
So, we need to implement our own .set_page_dirty method that
doesn't dirty buffers beyond EOF.

This is messy because the buffer code is not meant to be shared
and it has interesting locking issues on the buffer dirty bits.
So just copy and paste it and then modify it to suit what we need.

Note: the solutions the other filesystems and generic block code use
of marking the buffers clean in ->writepage does not work for XFS.
It still leaves dirty buffers beyond EOF and invalidations still
fail. Hence rather than play whack-a-mole, this patch simply
prevents those buffers from being dirtied in the first place.

cc:
Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Dave Chinner
2014-09-02 10:12:51 +0800

31 Aug, 2014

3 commits

35e274458 Merge tag 'locks-v3.17-3' of git://git.samba.org/jlayton/linux ... Browse Code »

Pull file locking bugfx from Jeff Layton:
"Just a bugfix for a bug that crept in to v3.15. It's in a rather rare
error path, and I'm not aware of anyone having hit it, but it's worth
fixing for v3.17"

* tag 'locks-v3.17-3' of git://git.samba.org/jlayton/linux:
locks: pass correct "before" pointer to locks_unlink_lock in generic_add_lease

Linus Torvalds
2014-08-31 12:04:37 +0800
81b6b0619 fix EBUSY on umount() from MNT_SHRINKABLE ... Browse Code »
5

We need the parents of victims alive until namespace_unlock() gets to
dput() of the (ex-)mountpoints. However, that screws up the "is it
busy" checks in case when we have shrinkable mounts that need to be
killed. Solution: go ahead and decrement refcounts of parents right
in umount_tree(), increment them again just before dropping rwsem in
namespace_unlock() (and let the loop in the end of namespace_unlock()
finally drop those references for good, as we do now). Parents can't
get freed until we drop rwsem - at least one reference is kept until
then, both in case when parent is among the victims and when it is
not. So they'll still be around when we get to namespace_unlock().

Cc: stable@vger.kernel.org # 3.12+
Signed-off-by: Al Viro

Al Viro
2014-08-31 06:32:05 +0800
88b368f27 get rid of propagate_umount() mistakenly treating slaves as busy. ... Browse Code »
5

The check in __propagate_umount() ("has somebody explicitly mounted
something on that slave?") is done *before* taking the already doomed
victims out of the child lists.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro

Al Viro
2014-08-31 06:31:41 +0800

30 Aug, 2014

5 commits

10f3291a1 Merge branch 'akpm' (fixes from Andrew Morton) ... Browse Code »

Merge patches from Andrew Morton:
"22 fixes"

* emailed patches from Andrew Morton : (22 commits)
kexec: purgatory: add clean-up for purgatory directory
Documentation/kdump/kdump.txt: add ARM description
flush_icache_range: export symbol to fix build errors
tools: selftests: fix build issue with make kselftests target
ocfs2: quorum: add a log for node not fenced
ocfs2: o2net: set tcp user timeout to max value
ocfs2: o2net: don't shutdown connection when idle timeout
ocfs2: do not write error flag to user structure we cannot copy from/to
x86/purgatory: use approprate -m64/-32 build flag for arch/x86/purgatory
drivers/rtc/rtc-s5m.c: re-add support for devices without irq specified
xattr: fix check for simultaneous glibc header inclusion
kexec: remove CONFIG_KEXEC dependency on crypto
kexec: create a new config option CONFIG_KEXEC_FILE for new syscall
x86,mm: fix pte_special versus pte_numa
hugetlb_cgroup: use lockdep_assert_held rather than spin_is_locked
mm/zpool: use prefixed module loading
zram: fix incorrect stat with failed_reads
lib: turn CONFIG_STACKTRACE into an actual option.
mm: actually clear pmd_numa before invalidating
memblock, memhotplug: fix wrong type in memblock_find_in_range_node().
...

Linus Torvalds
2014-08-30 07:28:29 +0800
8c7b638ce ocfs2: quorum: add a log for node not fenced ... Browse Code »

For debug use, we can see from the log whether the fence decision is
made and why it is not fenced.

Signed-off-by: Junxiao Bi
Reviewed-by: Srinivas Eeda
Reviewed-by: Mark Fasheh
Cc: Joel Becker
Cc: Joseph Qi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Junxiao Bi
2014-08-30 07:28:17 +0800
8e9801dfe ocfs2: o2net: set tcp user timeout to max value ... Browse Code »

When tcp retransmit timeout(15mins), the connection will be closed.
Pending messages may be lost during this time. So we set tcp user
timeout to override the retransmit timeout to the max value. This is OK
for ocfs2 since we have disk heartbeat, if peer crash, the disk
heartbeat will timeout and it will be evicted, if disk heartbeat not
timeout and connection idle for a long time, then this means the cluster
enters split-brain state, since fence can't happen, we'd better keep the
connection and wait network recover.

Signed-off-by: Junxiao Bi
Reviewed-by: Srinivas Eeda
Reviewed-by: Mark Fasheh
Cc: Joel Becker
Cc: Joseph Qi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Junxiao Bi
2014-08-30 07:28:16 +0800
c43c363de ocfs2: o2net: don't shutdown connection when idle timeout ... Browse Code »

This patch series is to fix a possible message lost bug in ocfs2 when
network go bad. This bug will cause ocfs2 hung forever even network
become good again.

The messages may lost in this case. After the tcp connection is
established between two nodes, an idle timer will be set to check its
state periodically, if no messages are received during this time, idle
timer will timeout, it will shutdown the connection and try to
reconnect, so pending messages in tcp queues will be lost. This
messages may be from dlm. Dlm may get hung in this case. This may
cause the whole ocfs2 cluster hung.

This is very possible to happen when network state goes bad. Do the
reconnect is useless, it will fail if network state is still bad. Just
waiting there for network recovering may be a good idea, it will not
lost messages and some node will be fenced until cluster goes into
split-brain state, for this case, Tcp user timeout is used to override
the tcp retransmit timeout. It will timeout after 25 days, user should
have notice this through the provided log and fix the network, if they
don't, ocfs2 will fall back to original reconnect way.

This patch (of 3):

Some messages in the tcp queue maybe lost if we shutdown the connection
and reconnect when idle timeout. If packets lost and reconnect success,
then the ocfs2 cluster maybe hung.

To fix this, we can leave the connection there and do the fence decision
when idle timeout, if network recover before fence dicision is made, the
connection survive without lost any messages.

This bug can be saw when network state go bad. It may cause ocfs2 hung
forever if some packets lost. With this fix, ocfs2 will recover from
hung if network becomes good again.

Signed-off-by: Junxiao Bi
Reviewed-by: Srinivas Eeda
Reviewed-by: Mark Fasheh
Cc: Joel Becker
Cc: Joseph Qi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Junxiao Bi
2014-08-30 07:28:16 +0800
2b462638e ocfs2: do not write error flag to user structure we cannot copy from/to ... Browse Code »

If we failed to copy from the structure, writing back the flags leaks 31
bits of kernel memory (the rest of the ir_flags field).

In any case, if we cannot copy from/to the structure, why should we
expect putting just the flags to work?

Also make sure ocfs2_info_handle_freeinode() returns the right error
code if the copy_to_user() fails.

Fixes: ddee5cdb70e6 ('Ocfs2: Add new OCFS2_IOC_INFO ioctl for ocfs2 v8.')
Signed-off-by: Ben Hutchings
Cc: Joel Becker
Acked-by: Mark Fasheh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ben Hutchings
2014-08-30 07:28:16 +0800