Eric Lee / smarc-fsl-linux-kernel

16 Aug, 2013

1 commit

2b047252d Fix TLB gather virtual address range invalidation corner cases ... Browse Code »

Ben Tebulin reported:

"Since v3.7.2 on two independent machines a very specific Git
repository fails in 9/10 cases on git-fsck due to an SHA1/memory
failures. This only occurs on a very specific repository and can be
reproduced stably on two independent laptops. Git mailing list ran
out of ideas and for me this looks like some very exotic kernel issue"

and bisected the failure to the backport of commit 53a59fc67f97 ("mm:
limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT").

That commit itself is not actually buggy, but what it does is to make it
much more likely to hit the partial TLB invalidation case, since it
introduces a new case in tlb_next_batch() that previously only ever
happened when running out of memory.

The real bug is that the TLB gather virtual memory range setup is subtly
buggered. It was introduced in commit 597e1c3580b7 ("mm/mmu_gather:
enable tlb flush range in generic mmu_gather"), and the range handling
was already fixed at least once in commit e6c495a96ce0 ("mm: fix the TLB
range flushed when __tlb_remove_page() runs out of slots"), but that fix
was not complete.

The problem with the TLB gather virtual address range is that it isn't
set up by the initial tlb_gather_mmu() initialization (which didn't get
the TLB range information), but it is set up ad-hoc later by the
functions that actually flush the TLB. And so any such case that forgot
to update the TLB range entries would potentially miss TLB invalidates.

Rather than try to figure out exactly which particular ad-hoc range
setup was missing (I personally suspect it's the hugetlb case in
zap_huge_pmd(), which didn't have the same logic as zap_pte_range()
did), this patch just gets rid of the problem at the source: make the
TLB range information available to tlb_gather_mmu(), and initialize it
when initializing all the other tlb gather fields.

This makes the patch larger, but conceptually much simpler. And the end
result is much more understandable; even if you want to play games with
partial ranges when invalidating the TLB contents in chunks, now the
range information is always there, and anybody who doesn't want to
bother with it won't introduce subtle bugs.

Ben verified that this fixes his problem.

Reported-bisected-and-tested-by: Ben Tebulin
Build-testing-by: Stephen Rothwell
Build-testing-by: Richard Weinberger
Reviewed-by: Michal Hocko
Acked-by: Peter Zijlstra
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Linus Torvalds
2013-08-16 23:52:46 +0800

14 Aug, 2013

7 commits

8c8296223 fs/proc/task_mmu.c: fix buffer overflow in add_page_map() ... Browse Code »

Recently we met quite a lot of random kernel panic issues after enabling
CONFIG_PROC_PAGE_MONITOR. After debuggind we found this has something
to do with following bug in pagemap:

In struct pagemapread:

struct pagemapread {
int pos, len;
pagemap_entry_t *buffer;
bool v2;
};

pos is number of PM_ENTRY_BYTES in buffer, but len is the size of
buffer, it is a mistake to compare pos and len in add_page_map() for
checking buffer is full or not, and this can lead to buffer overflow and
random kernel panic issue.

Correct len to be total number of PM_ENTRY_BYTES in buffer.

[akpm@linux-foundation.org: document pagemapread.pos and .len units, fix PM_ENTRY_BYTES definition]
Signed-off-by: Yonghua Zheng
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

yonghua zheng
2013-08-14 08:57:50 +0800
d6394b590 ocfs2: fix null pointer dereference in ocfs2_dir_foreach_blk_id() ... Browse Code »

Fix a NULL pointer deference while removing an empty directory, which
was introduced by commit 3704412bdbf3 ("[readdir] convert ocfs2").

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [] (null)
PGD 6da85067 PUD 6da89067 PMD 0
Oops: 0010 [#1] SMP
CPU: 0 PID: 6564 Comm: rmdir Tainted: G O 3.11.0-rc1 #4
RIP: 0010:[] [< (null)>] (null)
Call Trace:
ocfs2_dir_foreach+0x49/0x50 [ocfs2]
ocfs2_empty_dir+0x12c/0x3e0 [ocfs2]
ocfs2_unlink+0x56e/0xc10 [ocfs2]
vfs_rmdir+0xd5/0x140
do_rmdir+0x1cb/0x1e0
SyS_rmdir+0x16/0x20
system_call_fastpath+0x16/0x1b
Code: Bad RIP value.
RIP [< (null)>] (null)
RSP
CR2: 0000000000000000

[dan.carpenter@oracle.com: fix pointer math]
Signed-off-by: Jie Liu
Reported-by: David Weber
Cc: Al Viro
Cc: Joel Becker
Cc: Mark Fasheh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeff Liu
2013-08-14 08:57:49 +0800
c7dd3392a ocfs2: fix NULL pointer dereference in ocfs2_duplicate_clusters_by_page ... Browse Code »

Since ocfs2_cow_file_pos will invoke ocfs2_refcount_icow with a NULL as
the struct file pointer, it finally result in a null pointer dereference
in ocfs2_duplicate_clusters_by_page.

This patch replace file pointer with inode pointer in
cow_duplicate_clusters to fix this issue.

[jeff.liu@oracle.com: rebased patch against linux-next tree]
Signed-off-by: Tiger Yang
Signed-off-by: Jie Liu
Cc: Joel Becker
Cc: Mark Fasheh
Acked-by: Tao Ma
Tested-by: David Weber
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tiger Yang
2013-08-14 08:57:49 +0800
6115ea288 ocfs2: Revert 40bd62e to avoid regression in extended allocation ... Browse Code »

Revert commit 40bd62eb7fb8 ("fs/ocfs2/journal.h: add bits_wanted while
calculating credits in ocfs2_calc_extend_credits").

Unfortunately this change broke fallocate even if there is insufficient
disk space for the preallocation, which is a serious problem.

# df -h
/dev/sda8 22G 1.2G 21G 6% /ocfs2
# fallocate -o 0 -l 200M /ocfs2/testfile
fallocate: /ocfs2/test: fallocate failed: No space left on device

and a kernel warning:

CPU: 3 PID: 3656 Comm: fallocate Tainted: G W O 3.11.0-rc3 #2
Call Trace:
dump_stack+0x77/0x9e
warn_slowpath_common+0xc4/0x110
warn_slowpath_null+0x2a/0x40
start_this_handle+0x6c/0x640 [jbd2]
jbd2__journal_start+0x138/0x300 [jbd2]
jbd2_journal_start+0x23/0x30 [jbd2]
ocfs2_start_trans+0x166/0x300 [ocfs2]
__ocfs2_extend_allocation+0x38f/0xdb0 [ocfs2]
ocfs2_allocate_unwritten_extents+0x3c9/0x520
__ocfs2_change_file_space+0x5e0/0xa60 [ocfs2]
ocfs2_fallocate+0xb1/0xe0 [ocfs2]
do_fallocate+0x1cb/0x220
SyS_fallocate+0x6f/0xb0
system_call_fastpath+0x16/0x1b
JBD2: fallocate wants too many credits (51216 > 4381)

Signed-off-by: Jie Liu
Cc: Goldwyn Rodrigues
Cc: Joel Becker
Cc: Mark Fasheh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jie Liu
2013-08-14 08:57:49 +0800
b610ded71 hugetlb: fix lockdep splat caused by pmd sharing ... Browse Code »

Dave has reported the following lockdep splat:

=================================
[ INFO: inconsistent lock state ]
3.11.0-rc1+ #9 Not tainted
---------------------------------
inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
(&mapping->i_mmap_mutex){+.+.?.}, at: [] page_referenced+0x87/0x5e3
{RECLAIM_FS-ON-W} state was registered at:
mark_held_locks+0x81/0xe7
lockdep_trace_alloc+0x5e/0xbc
__alloc_pages_nodemask+0x8b/0x9b6
__get_free_pages+0x20/0x31
get_zeroed_page+0x12/0x14
__pmd_alloc+0x1c/0x6b
huge_pmd_share+0x265/0x283
huge_pte_alloc+0x5d/0x71
hugetlb_fault+0x7c/0x64a
handle_mm_fault+0x255/0x299
__do_page_fault+0x142/0x55c
do_page_fault+0xd/0x16
error_code+0x6c/0x74
irq event stamp: 3136917
hardirqs last enabled at (3136917): _raw_spin_unlock_irq+0x27/0x50
hardirqs last disabled at (3136916): _raw_spin_lock_irq+0x15/0x78
softirqs last enabled at (3136180): __do_softirq+0x137/0x30f
softirqs last disabled at (3136175): irq_exit+0xa8/0xaa
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&mapping->i_mmap_mutex);

lock(&mapping->i_mmap_mutex);

*** DEADLOCK ***
no locks held by kswapd0/49.

stack backtrace:
CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008
Call Trace:
dump_stack+0x4b/0x79
print_usage_bug+0x1d9/0x1e3
mark_lock+0x1e0/0x261
__lock_acquire+0x623/0x17f2
lock_acquire+0x7d/0x195
mutex_lock_nested+0x6c/0x3a7
page_referenced+0x87/0x5e3
shrink_page_list+0x3d9/0x947
shrink_inactive_list+0x155/0x4cb
shrink_lruvec+0x300/0x5ce
shrink_zone+0x53/0x14e
kswapd+0x517/0xa75
kthread+0xa8/0xaa
ret_from_kernel_thread+0x1b/0x28

which is a false positive caused by hugetlb pmd sharing code which
allocates a new pmd from withing mapping->i_mmap_mutex. If this
allocation causes reclaim then the lockdep detector complains that we
might self-deadlock.

This is not correct though, because hugetlb pages are not reclaimable so
their mapping will be never touched from the reclaim path.

The patch tells lockup detector that hugetlb i_mmap_mutex is special by
assigning it a separate lockdep class so it won't report possible
deadlocks on unrelated mappings.

[peterz@infradead.org: comment for annotation]
Reported-by: Dave Jones
Signed-off-by: Michal Hocko
Cc: Peter Zijlstra
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2013-08-14 08:57:48 +0800
41bb3476b mm: save soft-dirty bits on file pages ... Browse Code »

Andy reported that if file page get reclaimed we lose the soft-dirty bit
if it was there, so save _PAGE_BIT_SOFT_DIRTY bit when page address get
encoded into pte entry. Thus when #pf happens on such non-present pte
we can restore it back.

Reported-by: Andy Lutomirski
Signed-off-by: Cyrill Gorcunov
Acked-by: Pavel Emelyanov
Cc: Matt Mackall
Cc: Xiao Guangrong
Cc: Marcelo Tosatti
Cc: KOSAKI Motohiro
Cc: Stephen Rothwell
Cc: Peter Zijlstra
Cc: "Aneesh Kumar K.V"
Cc: Minchan Kim
Cc: Wanpeng Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2013-08-14 08:57:48 +0800
179ef71cb mm: save soft-dirty bits on swapped pages ... Browse Code »

Andy Lutomirski reported that if a page with _PAGE_SOFT_DIRTY bit set
get swapped out, the bit is getting lost and no longer available when
pte read back.

To resolve this we introduce _PTE_SWP_SOFT_DIRTY bit which is saved in
pte entry for the page being swapped out. When such page is to be read
back from a swap cache we check for bit presence and if it's there we
clear it and restore the former _PAGE_SOFT_DIRTY bit back.

One of the problem was to find a place in pte entry where we can save
the _PTE_SWP_SOFT_DIRTY bit while page is in swap. The _PAGE_PSE was
chosen for that, it doesn't intersect with swap entry format stored in
pte.

Reported-by: Andy Lutomirski
Signed-off-by: Cyrill Gorcunov
Acked-by: Pavel Emelyanov
Cc: Matt Mackall
Cc: Xiao Guangrong
Cc: Marcelo Tosatti
Cc: KOSAKI Motohiro
Cc: Stephen Rothwell
Cc: Peter Zijlstra
Cc: "Aneesh Kumar K.V"
Reviewed-by: Minchan Kim
Reviewed-by: Wanpeng Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2013-08-14 08:57:47 +0800

13 Aug, 2013

2 commits

584d88b2c Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6 ... Browse Code »

Pull CIFS fixes from Steve French:
"A set of small cifs fixes, including 3 relating to symlink handling"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
cifs: don't instantiate new dentries in readdir for inodes that need to be revalidated immediately
cifs: set sb->s_d_op before calling d_make_root()
cifs: fix bad error handling in crypto code
cifs: file: initialize oparms.reconnect before using it
Do not attempt to do cifs operations reading symlinks with SMB2
cifs: extend the buffer length enought for sprintf() using

Linus Torvalds
2013-08-13 06:02:53 +0800
fd4f35d0f Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 ... Browse Code »

Pull more ext4 bugfixes from Ted Ts'o:
"A number of miscellaneous ext4 bugs fixes for v3.11, including a fix
so that if ext4 is built as a module, to allow it to be unloaded"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: flush the extent status cache during EXT4_IOC_SWAP_BOOT
ext4: fix mount/remount error messages for incompatible mount options
ext4: allow the mount options nodelalloc and data=journal

Linus Torvalds
2013-08-13 06:00:40 +0800

12 Aug, 2013

1 commit

cde2d7a79 ext4: flush the extent status cache during EXT4_IOC_SWAP_BOOT ... Browse Code »

Previously we weren't swapping only some of the extent_status LRU
fields during the processing of the EXT4_IOC_SWAP_BOOT ioctl. The
much safer thing to do is to just completely flush the extent status
tree when doing the swap.

Signed-off-by: "Theodore Ts'o"
Cc: Zheng Liu
Cc: stable@vger.kernel.org

Theodore Ts'o
2013-08-12 21:29:30 +0800

11 Aug, 2013

3 commits

d92581fca Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull btrfs fixes from Chris Mason:
"These are assorted fixes, mostly from Josef nailing down xfstests
runs. Zach also has a long standing fix for problems with readdir
wrapping f_pos (or ctx->pos)

These patches were spread out over different bases, so I rebased
things on top of rc4 and retested overnight"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: don't loop on large offsets in readdir
Btrfs: check to see if root_list is empty before adding it to dead roots
Btrfs: release both paths before logging dir/changed extents
Btrfs: allow splitting of hole em's when dropping extent cache
Btrfs: make sure the backref walker catches all refs to our extent
Btrfs: fix backref walking when we hit a compressed extent
Btrfs: do not offset physical if we're compressed
Btrfs: fix extent buffer leak after backref walking
Btrfs: fix a bug of snapshot-aware defrag to make it work on partial extents
btrfs: fix file truncation if FALLOC_FL_KEEP_SIZE is specified

Linus Torvalds
2013-08-11 06:21:47 +0800
b8ea0d06f Merge tag 'nfs-for-3.11-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs ... Browse Code »

Pull NFS client bugfixes from Trond Myklebust:

- Stable patch for lockd to fix Oopses due to inappropriate calls to
utsname()->nodename

- Stable patches for sunrpc to fix Oopses on shutdown when using
AF_LOCAL sockets with rpcbind

- Fix memory leak and error checking issues in nfs4_proc_lookup_mountpoint

- Fix a regression with the sync mount option failing to work for nfs4
mounts

- Fix a writeback performance issue when doing cache invalidation

- Remove an incorrect call to nfs_setsecurity in nfs_fhget

* tag 'nfs-for-3.11-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4: Fix up nfs4_proc_lookup_mountpoint
NFS: Remove unnecessary call to nfs_setsecurity in nfs_fhget()
NFSv4: Fix the sync mount option for nfs4 mounts
NFS: Fix writeback performance issue on cache invalidation
SUNRPC: If the rpcbind channel is disconnected, fail the call to unregister
SUNRPC: Don't auto-disconnect from the local rpcbind socket
LOCKD: Don't call utsname()->nodename from nlmclnt_setlockargs

Linus Torvalds
2013-08-11 06:20:37 +0800
022e5d098 Merge branch 'for-3.11' of git://linux-nfs.org/~bfields/linux ... Browse Code »

Pull nfsd fixes from Bruce Fields:
"Some fixes for a 4.1 feature that in retrospect probably should have
waited for 3.12.... But it appears to be working now"

* 'for-3.11' of git://linux-nfs.org/~bfields/linux:
nfsd: Fix SP4_MACH_CRED negotiation in EXCHANGE_ID
nfsd4: Fix MACH_CRED NULL dereference

Linus Torvalds
2013-08-11 06:19:58 +0800

10 Aug, 2013

11 commits

db62efbbf btrfs: don't loop on large offsets in readdir ... Browse Code »

When btrfs readdir() hits the last entry it sets the readdir offset to a
huge value to stop buggy apps from breaking when the same name is
returned by readdir() with concurrent rename()s.

But unconditionally setting the offset to INT_MAX causes readdir() to
loop returning any entries with offsets past INT_MAX. It only takes a
few hours of constant file creation and removal to create entries past
INT_MAX.

So let's set the huge offset to LLONG_MAX if the last entry has already
overflowed 32bit loff_t. Without large offsets behaviour is identical.
With large offsets 64bit apps will work and 32bit apps will be no more
broken than they currently are if they see large offsets.

Signed-off-by: Zach Brown
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Zach Brown
2013-08-10 07:34:56 +0800
cfad392b2 Btrfs: check to see if root_list is empty before adding it to dead roots ... Browse Code »

A user reported a panic when running with autodefrag and deleting snapshots.
This is because we could end up trying to add the root to the dead roots list
twice. To fix this check to see if we are empty before adding ourselves to the
dead roots list. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2013-08-10 07:30:23 +0800
f3b15ccdb Btrfs: release both paths before logging dir/changed extents ... Browse Code »

The ceph guys tripped over this bug where we were still holding onto the
original path that we used to copy the inode with when logging. This is based
on Chris's fix which was reported to fix the problem. We need to drop the paths
in two cases anyway so just move the drop up so that we don't have duplicate
code. Thanks,

Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2013-08-10 07:30:16 +0800
ee20a9831 Btrfs: allow splitting of hole em's when dropping extent cache ... Browse Code »

I noticed while running multi-threaded fsync tests that sometimes fsck would
complain about an improper gap. This happens because we fail to add a hole
extent to the file, which was happening when we'd split a hole EM because
btrfs_drop_extent_cache was just discarding the whole em instead of splitting
it. So this patch fixes this by allowing us to split a hole em properly, which
means that added holes actually get logged properly and we no longer see this
fsck error. Thankfully we're tolerant of these sort of problems so a user would
not see any adverse effects of this bug, other than fsck complaining. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2013-08-10 07:30:09 +0800
ed8c4913d Btrfs: make sure the backref walker catches all refs to our extent ... Browse Code »

Because we don't mess with the offset into the extent for compressed we will
properly find both extents for this case

[extent a][extent b][rest of extent a]

but because we already added a ref for the front half we won't add the inode
information for the second half. This causes us to leak that memory and not
print out the other offset when we do logical-resolve. So fix this by calling
ulist_add_merge and then add our eie to the existing entry if there is one.
With this patch we get both offsets out of logical-resolve. With this and the
other 2 patches I've sent we now pass btrfs/276 on my vm with compress-force=lzo
set. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2013-08-10 07:30:03 +0800
8ca15e05e Btrfs: fix backref walking when we hit a compressed extent ... Browse Code »

If you do btrfs inspect-internal logical-resolve on a compressed extent that has
been partly overwritten it won't find anything. This is because we try and
match the extent offset we've searched for based on the extent offset in the
data extent entry. However this doesn't work for compressed extents because the
offsets are for the uncompressed size, not the compressed size. So instead only
do this check if we are not compressed, that way we can get an actual entry for
the physical offset rather than nothing for compressed. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2013-08-10 07:29:56 +0800
b76bb7013 Btrfs: do not offset physical if we're compressed ... Browse Code »

xfstest btrfs/276 was freaking out on slower boxes partly because fiemap was
offsetting the physical based on the extent offset. This is perfectly fine with
uncompressed extents, however the extent offset is into the uncompressed area,
not the compressed. So we can return a physical value that isn't at all within
the area we have allocated on disk. Fix this by returning the start of the
extent if it is compressed no matter what the offset. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2013-08-10 07:29:50 +0800
b5b9b5b31 Btrfs: fix extent buffer leak after backref walking ... Browse Code »

commit 47fb091fb787420cd195e66f162737401cce023f(Btrfs: fix unlock after free on rewinded tree blocks)
takes an extra increment on the reference of allocated dummy extent buffer, so now we
cannot free this dummy one, and end up with extent buffer leak.

Signed-off-by: Liu Bo
Reviewed-by: Jan Schmidt
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Liu Bo
2013-08-10 07:29:42 +0800
e68afa49a Btrfs: fix a bug of snapshot-aware defrag to make it work on partial extents ... Browse Code »

For partial extents, snapshot-aware defrag does not work as expected,
since
a) we use the wrong logical offset to search for parents, which should be
disk_bytenr + extent_offset, not just disk_bytenr,
b) 'offset' returned by the backref walking just refers to key.offset, not
the 'offset' stored in btrfs_extent_data_ref which is
(key.offset - extent_offset).

The reproducer:
$ mkfs.btrfs sda
$ mount sda /mnt
$ btrfs sub create /mnt/sub
$ for i in `seq 5 -1 1`; do dd if=/dev/zero of=/mnt/sub/foo bs=5k count=1 seek=$i conv=notrunc oflag=sync; done
$ btrfs sub snap /mnt/sub /mnt/snap1
$ btrfs sub snap /mnt/sub /mnt/snap2
$ sync; btrfs filesystem defrag /mnt/sub/foo;
$ umount /mnt
$ btrfs-debug-tree sda (Here we can check whether the defrag operation is snapshot-awared.

This addresses the above two problems.

Signed-off-by: Liu Bo
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Liu Bo
2013-08-10 07:29:17 +0800
7cddc1939 btrfs: fix file truncation if FALLOC_FL_KEEP_SIZE is specified ... Browse Code »

Create a small file and fallocate it to a big size with
FALLOC_FL_KEEP_SIZE option, then truncate it back to the
small size again, the disk free space is not changed back
in this case. i.e,

total 4
-rw-r--r-- 1 root root 512 Jun 28 11:35 test

Filesystem Size Used Avail Use% Mounted on
....
/dev/sdb1 8.0G 56K 7.2G 1% /mnt

-rw-r--r-- 1 root root 512 Jun 28 11:35 /mnt/test

Filesystem Size Used Avail Use% Mounted on
....
/dev/sdb1 8.0G 5.1G 2.2G 70% /mnt

Filesystem Size Used Avail Use% Mounted on
....
/dev/sdb1 8.0G 5.1G 2.2G 70% /mnt

With this fix, the truncated up space is back as:
Filesystem Size Used Avail Use% Mounted on
....
/dev/sdb1 8.0G 56K 7.2G 1% /mnt

Signed-off-by: Jie Liu
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Jie Liu
2013-08-10 07:28:56 +0800
201d3dfa4 dlm: kill the unnecessary and wrong device_close()->recalc_sigpending() ... Browse Code »

device_close()->recalc_sigpending() is not needed, sigprocmask() takes
care of TIF_SIGPENDING correctly.

And without ->siglock it is racy and wrong, it can wrongly clear
TIF_SIGPENDING and miss a signal.

But even with this patch device_close() is still buggy:

1. sigprocmask() should not be used, we have set_task_blocked(),
but this is minor.

2. We should never block SIGKILL or SIGSTOP, and this is what
the code tries to do.

3. This can't protect against SIGKILL or SIGSTOP anyway. Another
thread can do signal_wake_up(), say, do_signal_stop() or
complete_signal() or debugger.

4. sigprocmask(SIG_BLOCK, allsigs) doesn't necessarily clears
TIF_SIGPENDING, say, freezing() or ->jobctl.

5. device_write() looks equally wrong by the same reason.

Looks like, this tries to protect some wait_event_interruptible() logic
from signals, it should be turned into uninterruptible wait. Or we need
to implement something like signals_stop/start for such a use-case.

Signed-off-by: Oleg Nesterov
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-08-10 01:48:20 +0800

09 Aug, 2013

3 commits

6ae6514b3 ext4: fix mount/remount error messages for incompatible mount options ... Browse Code »

Commit 5688978 ("ext4: improve handling of conflicting mount options")
introduced incorrect messages shown while choosing wrong mount options.

First of all, both cases of incorrect mount options,
"data=journal,delalloc" and "data=journal,dioread_nolock" result in
the same error message.

Secondly, the problem above isn't solved for remount option: the
mismatched parameter is simply ignored. Moreover, ext4_msg states
that remount with options "data=journal,delalloc" succeeded, which is
not true.

To fix it up, I added a simple check after parse_options() call to
ensure that data=journal and delalloc/dioread_nolock parameters are
not present at the same time.

Signed-off-by: Piotr Sarna
Acked-by: Bartlomiej Zolnierkiewicz
Signed-off-by: Kyungmin Park
Signed-off-by: "Theodore Ts'o"
Cc: stable@vger.kernel.org

Piotr Sarna
2013-08-09 11:02:24 +0800
59d9fa5c2 ext4: allow the mount options nodelalloc and data=journal ... Browse Code »

Commit 26092bf ("ext4: use a table-driven handler for mount options")
wrongly disallows the specifying the mount options nodelalloc and
data=journal simultaneously. This is incorrect; it should have only
disallowed the combination of delalloc and data=journal
simultaneously.

Reported-by: Piotr Sarna
Signed-off-by: "Theodore Ts'o"
Cc: stable@vger.kernel.org

Theodore Ts'o
2013-08-09 11:01:24 +0800
55f5bfd4c Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 ... Browse Code »

Pull ext4 bugfixes from Ted Ts'o.

Misc ext4 fixes, delayed by Ted moving mail servers and email getting
marked as spam due to bad spf records.

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: add WARN_ON to check the length of allocated blocks
ext4: fix retry handling in ext4_ext_truncate()
ext4: destroy ext4_es_cachep on module unload
ext4: make sure group number is bumped after a inode allocation race

Linus Torvalds
2013-08-09 00:38:19 +0800

08 Aug, 2013

7 commits

b72888cb0 NFSv4: Fix up nfs4_proc_lookup_mountpoint ... Browse Code »

Currently, we do not check the return value of client = rpc_clone_client(),
nor do we shut down the resulting cloned rpc_clnt in the case where a
NFS4ERR_WRONGSEC has caused nfs4_proc_lookup_common() to replace the
original value of 'client' (causing a memory leak).

Fix both issues and simplify the code by moving the call to
rpc_clone_client() until after nfs4_proc_lookup_common() has
done its business.

Reported-by: Andy Adamson
Signed-off-by: Trond Myklebust

Trond Myklebust
2013-08-08 08:47:26 +0800
eddffa408 NFS: Remove unnecessary call to nfs_setsecurity in nfs_fhget() ... Browse Code »

We only need to call it on the creation of the inode.

Reported-by: Julia Lawall
Cc: Steve Dickson
Cc: Dave Quigley
Signed-off-by: Trond Myklebust

Trond Myklebust
2013-08-08 05:07:41 +0800
e890db010 NFSv4: Fix the sync mount option for nfs4 mounts ... Browse Code »

The sync mount option stopped working for NFSv4 mounts after commit
c02d7adf8c5429727a98bad1d039bccad4c61c50 (NFSv4: Replace nfs4_path_walk() with
FS path lookup in a private namespace). If MS_SYNCHRONOUS is set in the
super_block that we're cloning from, then it should be set in the new
super_block as well.

Signed-off-by: Scott Mayhew
Signed-off-by: Trond Myklebust

Scott Mayhew
2013-08-08 05:07:41 +0800
f8806c843 NFS: Fix writeback performance issue on cache invalidation ... Browse Code »

If a cache invalidation is triggered, and we happen to have a lot of
writebacks cached at the time, then the call to invalidate_inode_pages2()
will end up calling ->launder_page() on each and every dirty page in order
to sync its contents to disk, thus defeating write coalescing.
The following patch ensures that we try to sync the inode to disk before
calling invalidate_inode_pages2() so that we do the writeback as efficiently
as possible.

Reported-by: William Dauchy
Reported-by: Pascal Bouchareine
Signed-off-by: Trond Myklebust
Tested-by: William Dauchy
Reviewed-by: Jeff Layton

Trond Myklebust
2013-08-08 05:07:40 +0800
b7bc9e7d8 Merge tag 'trace-fixes-3.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/gi… ... Browse Code »

…t/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
"Oleg Nesterov has been working hard in closing all the holes that can
lead to race conditions between deleting an event and accessing an
event debugfs file. This included a fix to the debugfs system (acked
by Greg Kroah-Hartman). We think that all the holes have been patched
and hopefully we don't find more. I haven't marked all of them for
stable because I need to examine them more to figure out how far back
some of the changes need to go.

Along the way, some other fixes have been made. Alexander Z Lam fixed
some logic where the wrong buffer was being modifed.

Andrew Vagin found a possible corruption for machines that actually
allocate cpumask, as a reference to one was being zeroed out by
mistake.

Dhaval Giani found a bad prototype when tracing is not configured.

And I not only had some changes to help Oleg, but also finally fixed a
long standing bug that Dave Jones and others have been hitting, where
a module unload and reload can cause the function tracing accounting
to get screwed up"

* tag 'trace-fixes-3.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix reset of time stamps during trace_clock changes
tracing: Make TRACE_ITER_STOP_ON_FREE stop the correct buffer
tracing: Fix trace_dump_stack() proto when CONFIG_TRACING is not set
tracing: Fix fields of struct trace_iterator that are zeroed by mistake
tracing/uprobes: Fail to unregister if probe event files are in use
tracing/kprobes: Fail to unregister if probe event files are in use
tracing: Add comment to describe special break case in probe_remove_event_call()
tracing: trace_remove_event_call() should fail if call/file is in use
debugfs: debugfs_remove_recursive() must not rely on list_empty(d_subdirs)
ftrace: Check module functions being traced on reload
ftrace: Consolidate some duplicate code for updating ftrace ops
tracing: Change remove_event_file_dir() to clear "d_subdirs"->i_private
tracing: Introduce remove_event_file_dir()
tracing: Change f_start() to take event_mutex and verify i_private != NULL
tracing: Change event_filter_read/write to verify i_private != NULL
tracing: Change event_enable/disable_read() to verify i_private != NULL
tracing: Turn event/id->i_private into call->event.type

Linus Torvalds
2013-08-08 04:01:30 +0800
58cd57bfd nfsd: Fix SP4_MACH_CRED negotiation in EXCHANGE_ID ... Browse Code »

- don't BUG_ON() when not SP4_NONE
- calculate recv and send reserve sizes correctly

Signed-off-by: Weston Andros Adamson
Signed-off-by: J. Bruce Fields

Weston Andros Adamson
2013-08-08 00:06:07 +0800
c47205914 nfsd4: Fix MACH_CRED NULL dereference ... Browse Code »

Fixes a NULL-dereference on attempts to use MACH_CRED protection over
auth_sys.

Signed-off-by: J. Bruce Fields

J. Bruce Fields
2013-08-08 00:05:51 +0800

07 Aug, 2013

1 commit

757c4f626 cifs: don't instantiate new dentries in readdir for inodes that need to be revalidated immediately ... Browse Code »

David reported that commit c2b93e06 (cifs: only set ops for inodes in
I_NEW state) caused a regression with mfsymlinks. Prior to that patch,
if a mfsymlink dentry was instantiated at readdir time, the inode would
get a new set of ops when it was revalidated. After that patch, this
did not occur.

This patch addresses this by simply skipping instantiating dentries in
the readdir codepath when we know that they will need to be immediately
revalidated. The next attempt to use that dentry will cause a new lookup
to occur (which is basically what we want to happen anyway).

Cc:
Cc: "Stefan (metze) Metzmacher"
Cc: Sachin Prabhu
Reported-and-Tested-by: David McBride
Signed-off-by: Jeff Layton
Signed-off-by: Steve French

Jeff Layton
2013-08-07 23:57:06 +0800

06 Aug, 2013

1 commit

9a1b6bf81 LOCKD: Don't call utsname()->nodename from nlmclnt_setlockargs ... Browse Code »

Firstly, nlmclnt_setlockargs can be called from a reclaimer thread, in
which case we're in entirely the wrong namespace.

Secondly, commit 8aac62706adaaf0fab02c4327761561c8bda9448 (move
exit_task_namespaces() outside of exit_notify()) now means that
exit_task_work() is called after exit_task_namespaces(), which
triggers an Oops when we're freeing up the locks.

Fix this by ensuring that we initialise the nlm_host's rpc_client at mount
time, so that the cl_nodename field is initialised to the value of
utsname()->nodename that the net namespace uses. Then replace the
lockd callers of utsname()->nodename.

Signed-off-by: Trond Myklebust
Cc: Toralf Förster
Cc: Oleg Nesterov
Cc: Nix
Cc: Jeff Layton
Cc: stable@vger.kernel.org # 3.10.x

Trond Myklebust
2013-08-06 03:03:46 +0800

05 Aug, 2013

3 commits

3d62c45b3 vfs: add missing check for __O_TMPFILE in fcntl_init() ... Browse Code »

As comment in include/uapi/asm-generic/fcntl.h described, when
introducing new O_* bits, we need to check its uniqueness in
fcntl_init(). But __O_TMPFILE bit is missing. So fix it.

Signed-off-by: Zheng Liu
Signed-off-by: Al Viro

Zheng Liu
2013-08-05 22:25:32 +0800
bb2314b47 fs: Allow unprivileged linkat(..., AT_EMPTY_PATH) aka flink ... Browse Code »

Every now and then someone proposes a new flink syscall, and this spawns
a long discussion of whether it would be a security problem. I think
that this is missing the point: flink is *already* allowed without
privilege as long as /proc is mounted -- it's called AT_SYMLINK_FOLLOW.

Now that O_TMPFILE is here, the ability to create a file with O_TMPFILE,
write it, and link it in is very convenient. The only problem is that
it requires that /proc be mounted so that you can do:

linkat(AT_FDCWD, "/proc/self/fd/", dfd, path, AT_SYMLINK_NOFOLLOW)

This sucks -- it's much nicer to do:

linkat(tmpfd, "", dfd, path, AT_EMPTY_PATH)

Let's allow it.

If this turns out to be excessively scary, it we could instead require
that the inode in question be I_LINKABLE, but this seems pointless given
the /proc situation

Signed-off-by: Andy Lutomirski
Signed-off-by: Al Viro

Andy Lutomirski
2013-08-05 22:24:11 +0800
e305f48bc fs: Fix file mode for O_TMPFILE ... Browse Code »

O_TMPFILE, like O_CREAT, should respect the requested mode and should
create regular files.

This fixes two bugs: O_TMPFILE required privilege (because the mode
ended up as 000) and it produced bogus inodes with no type.

Signed-off-by: Andy Lutomirski
Signed-off-by: Al Viro

Andy Lutomirski
2013-08-05 22:24:10 +0800