Eric Lee / smarc-fsl-linux-kernel

03 Apr, 2012

20 commits

8633c0f77 NFSv4: Fix two infinite loops in the mount code ... Browse Code »

commit 05e9cfb408b24debb3a85fd98edbfd09dd148881 upstream.

We can currently loop forever in nfs4_lookup_root() and in
nfs41_proc_secinfo_no_name(), if the first iteration returns a
NFS4ERR_DELAY or something else that causes exception.retry to get
set.

Reported-by: Weston Andros Adamson
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2012-04-03 01:32:22 +0800
1335e5bef xfs: Fix oops on IO error during xlog_recover_process_iunlinks() ... Browse Code »

commit d97d32edcd732110758799ae60af725e5110b3dc upstream.

When an IO error happens during inode deletion run from
xlog_recover_process_iunlinks() filesystem gets shutdown. Thus any subsequent
attempt to read buffers fails. Code in xlog_recover_process_iunlinks() does not
count with the fact that read of a buffer which was read a while ago can
really fail which results in the oops on
agi = XFS_BUF_TO_AGI(agibp);

Fix the problem by cleaning up the buffer handling in
xlog_recover_process_iunlinks() as suggested by Dave Chinner. We release buffer
lock but keep buffer reference to AG buffer. That is enough for buffer to stay
pinned in memory and we don't have to call xfs_read_agi() all the time.

Signed-off-by: Jan Kara
Reviewed-by: Dave Chinner
Signed-off-by: Ben Myers
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2012-04-03 01:32:21 +0800
403ce368a vfs: fix d_ancestor() case in d_materialize_unique ... Browse Code »

commit b18dafc86bb879d2f38a1743985d7ceb283c2f4d upstream.

In d_materialise_unique() there are 3 subcases to the 'aliased dentry'
case; in two subcases the inode i_lock is properly released but this
does not occur in the -ELOOP subcase.

This seems to have been introduced by commit 1836750115f2 ("fix loop
checks in d_materialise_unique()").

Signed-off-by: Michel Lespinasse
[ Added a comment, and moved the unlock to where we generate the -ELOOP,
which seems to be more natural.

You probably can't actually trigger this without a buggy network file
server - d_materialize_unique() is for finding aliases on non-local
filesystems, and the d_ancestor() case is for a hardlinked directory
loop.

But we should be robust in the case of such buggy servers anyway. ]
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Michel Lespinasse
2012-04-03 01:32:20 +0800
786be162a ext4: check for zero length extent ... Browse Code »

commit 31d4f3a2f3c73f279ff96a7135d7202ef6833f12 upstream.

Explicitly test for an extent whose length is zero, and flag that as a
corrupted extent.

This avoids a kernel BUG_ON assertion failure.

Tested: Without this patch, the file system image found in
tests/f_ext_zero_len/image.gz in the latest e2fsprogs sources causes a
kernel panic. With this patch, an ext4 file system error is noted
instead, and the file system is marked as being corrupted.

https://bugzilla.kernel.org/show_bug.cgi?id=42859

Signed-off-by: "Theodore Ts'o"
Signed-off-by: Greg Kroah-Hartman

Theodore Ts'o
2012-04-03 01:32:20 +0800
2cb2d68b8 ext4: fix race between sync and completed io work ... Browse Code »

commit 491caa43639abcffaa645fbab372a7ef4ce2975c upstream.

The following command line will leave the aio-stress process unkillable
on an ext4 file system (in my case, mounted on /mnt/test):

aio-stress -t 20 -s 10 -O -S -o 2 -I 1000 /mnt/test/aiostress.3561.4 /mnt/test/aiostress.3561.4.20 /mnt/test/aiostress.3561.4.19 /mnt/test/aiostress.3561.4.18 /mnt/test/aiostress.3561.4.17 /mnt/test/aiostress.3561.4.16 /mnt/test/aiostress.3561.4.15 /mnt/test/aiostress.3561.4.14 /mnt/test/aiostress.3561.4.13 /mnt/test/aiostress.3561.4.12 /mnt/test/aiostress.3561.4.11 /mnt/test/aiostress.3561.4.10 /mnt/test/aiostress.3561.4.9 /mnt/test/aiostress.3561.4.8 /mnt/test/aiostress.3561.4.7 /mnt/test/aiostress.3561.4.6 /mnt/test/aiostress.3561.4.5 /mnt/test/aiostress.3561.4.4 /mnt/test/aiostress.3561.4.3 /mnt/test/aiostress.3561.4.2

This is using the aio-stress program from the xfstests test suite.
That particular command line tells aio-stress to do random writes to
20 files from 20 threads (one thread per file). The files are NOT
preallocated, so you will get writes to random offsets within the
file, thus creating holes and extending i_size. It also opens the
file with O_DIRECT and O_SYNC.

On to the problem. When an I/O requires unwritten extent conversion,
it is queued onto the completed_io_list for the ext4 inode. Two code
paths will pull work items from this list. The first is the
ext4_end_io_work routine, and the second is ext4_flush_completed_IO,
which is called via the fsync path (and O_SYNC handling, as well).
There are two issues I've found in these code paths. First, if the
fsync path beats the work routine to a particular I/O, the work
routine will free the io_end structure! It does not take into account
the fact that the io_end may still be in use by the fsync path. I've
fixed this issue by adding yet another IO_END flag, indicating that
the io_end is being processed by the fsync path.

The second problem is that the work routine will make an assignment to
io->flag outside of the lock. I have witnessed this result in a hang
at umount. Moving the flag setting inside the lock resolved that
problem.

The problem was introduced by commit b82e384c7b ("ext4: optimize
locking for end_io extent conversion"), which first appeared in 3.2.
As such, the fix should be backported to that release (probably along
with the unwritten extent conversion race fix).

Signed-off-by: Jeff Moyer
Signed-off-by: "Theodore Ts'o"
Signed-off-by: Greg Kroah-Hartman

Jeff Moyer
2012-04-03 01:32:20 +0800
2a45b2e1c ext4: fix race between unwritten extent conversion and truncate ... Browse Code »

commit 266991b13890049ee1a6bb95b9817f06339ee3d7 upstream.

The following comment in ext4_end_io_dio caught my attention:

/* XXX: probably should move into the real I/O completion handler */
inode_dio_done(inode);

The truncate code takes i_mutex, then calls inode_dio_wait. Because the
ext4 code path above will end up dropping the mutex before it is
reacquired by the worker thread that does the extent conversion, it
seems to me that the truncate can happen out of order. Jan Kara
mentioned that this might result in error messages in the system logs,
but that should be the extent of the "damage."

The fix is pretty straight-forward: don't call inode_dio_done until the
extent conversion is complete.

Reviewed-by: Jan Kara
Signed-off-by: Jeff Moyer
Signed-off-by: "Theodore Ts'o"
Signed-off-by: Greg Kroah-Hartman

Jeff Moyer
2012-04-03 01:32:20 +0800
9dacb0fb0 ext4: ignore EXT4_INODE_JOURNAL_DATA flag with delalloc ... Browse Code »

commit 3d2b158262826e8b75bbbfb7b97010838dd92ac7 upstream.

Ext4 does not support data journalling with delayed allocation enabled.
We even do not allow to mount the file system with delayed allocation
and data journalling enabled, however it can be set via FS_IOC_SETFLAGS
so we can hit the inode with EXT4_INODE_JOURNAL_DATA set even on file
system mounted with delayed allocation (default) and that's where
problem arises. The easies way to reproduce this problem is with the
following set of commands:

mkfs.ext4 /dev/sdd
mount /dev/sdd /mnt/test1
dd if=/dev/zero of=/mnt/test1/file bs=1M count=4
chattr +j /mnt/test1/file
dd if=/dev/zero of=/mnt/test1/file bs=1M count=4 conv=notrunc
chattr -j /mnt/test1/file

Additionally it can be reproduced quite reliably with xfstests 272 and
269. In fact the above reproducer is a part of test 272.

To fix this we should ignore the EXT4_INODE_JOURNAL_DATA inode flag if
the file system is mounted with delayed allocation. This can be easily
done by fixing ext4_should_*_data() functions do ignore data journal
flag when delalloc is set (suggested by Ted). We also have to set the
appropriate address space operations for the inode (again, ignoring data
journal flag if delalloc enabled).

Additionally this commit introduces ext4_inode_journal_mode() function
because ext4_should_*_data() has already had a lot of common code and
this change is putting it all into one function so it is easier to
read.

Successfully tested with xfstests in following configurations:

delalloc + data=ordered
delalloc + data=writeback
data=journal
nodelalloc + data=ordered
nodelalloc + data=writeback
nodelalloc + data=journal

Signed-off-by: Lukas Czerner
Signed-off-by: "Theodore Ts'o"
Signed-off-by: Greg Kroah-Hartman

Lukas Czerner
2012-04-03 01:32:13 +0800
7da65e2c1 jbd2: clear BH_Delay & BH_Unwritten in journal_unmap_buffer ... Browse Code »

commit 15291164b22a357cb211b618adfef4fa82fc0de3 upstream.

journal_unmap_buffer()'s zap_buffer: code clears a lot of buffer head
state ala discard_buffer(), but does not touch _Delay or _Unwritten as
discard_buffer() does.

This can be problematic in some areas of the ext4 code which assume
that if they have found a buffer marked unwritten or delay, then it's
a live one. Perhaps those spots should check whether it is mapped
as well, but if jbd2 is going to tear down a buffer, let's really
tear it down completely.

Without this I get some fsx failures on sub-page-block filesystems
up until v3.2, at which point 4e96b2dbbf1d7e81f22047a50f862555a6cb87cb
and 189e868fa8fdca702eb9db9d8afc46b5cb9144c9 make the failures go
away, because buried within that large change is some more flag
clearing. I still think it's worth doing in jbd2, since
->invalidatepage leads here directly, and it's the right place
to clear away these flags.

Signed-off-by: Eric Sandeen
Signed-off-by: "Theodore Ts'o"
Signed-off-by: Greg Kroah-Hartman

Eric Sandeen
2012-04-03 01:32:11 +0800
f8fbfe880 NFSv4: Rate limit the state manager warning messages ... Browse Code »

commit 9a3ba432330e504ac61ff0043dbdaba7cea0e35a upstream.

Prevent the state manager from filling up system logs when recovery
fails on the server.

Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2012-04-03 01:32:10 +0800
dc2497862 sysctl: protect poll() in entries that may go away ... Browse Code »

commit 4e474a00d7ff746ed177ddae14fa8b2d4bad7a00 upstream.

Protect code accessing ctl_table by grabbing the header with grab_header()
and after releasing with sysctl_head_finish(). This is needed if poll()
is called in entries created by modules: currently only hostname and
domainname support poll(), but this bug may be triggered when/if modules
use it and if user called poll() in a file that doesn't support it.

Dave Jones reported the following when using a syscall fuzzer while
hibernating/resuming:

RIP: 0010:[] [] proc_sys_poll+0x4e/0x90
RAX: 0000000000000145 RBX: ffff88020cab6940 RCX: 0000000000000000
RDX: ffffffff81233df0 RSI: 6b6b6b6b6b6b6b6b RDI: ffff88020cab6940
[ ... ]
Code: 00 48 89 fb 48 89 f1 48 8b 40 30 4c 8b 60 e8 b8 45 01 00 00 49 83
7c 24 28 00 74 2e 49 8b 74 24 30 48 85 f6 74 24 48 85 c9 75 32 16
b8 45 01 00 00 48 63 d2 49 39 d5 74 10 8b 06 48 98 48 89

If an entry goes away while we are polling() it, ctl_table may not exist
anymore.

Reported-by: Dave Jones
Signed-off-by: Lucas De Marchi
Cc: Al Viro
Cc: Linus Torvalds
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Eric W. Biederman
Signed-off-by: Greg Kroah-Hartman

Lucas De Marchi
2012-04-03 01:32:09 +0800
812e7c226 proc-ns: use d_set_d_op() API to set dentry ops in proc_ns_instantiate(). ... Browse Code »

commit 1b26c9b334044cff6d1d2698f2be41bc7d9a0864 upstream.

The namespace cleanup path leaks a dentry which holds a reference count
on a network namespace. Keeping that network namespace from being freed
when the last user goes away. Leaving things like vlan devices in the
leaked network namespace.

If you use ip netns add for much real work this problem becomes apparent
pretty quickly. It light testing the problem hides because frequently
you simply don't notice the leak.

Use d_set_d_op() so that DCACHE_OP_* flags are set correctly.

This issue exists back to 3.0.

Acked-by: "Eric W. Biederman"
Reported-by: Justin Pettit
Signed-off-by: Pravin B Shelar
Signed-off-by: Jesse Gross
Cc: David Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Pravin B Shelar
2012-04-03 01:32:08 +0800
a9f67ccd8 CIFS: Fix a spurious error in cifs_push_posix_locks ... Browse Code »

commit ce85852b90a214cf577fc1b4f49d99fd7e98784a upstream.

Signed-off-by: Pavel Shilovsky
Reviewed-by: Jeff Layton
Reported-by: Ben Hutchings
Signed-off-by: Steve French
Signed-off-by: Greg Kroah-Hartman

Pavel Shilovsky
2012-04-03 01:32:07 +0800
f743834c4 cifs: fix issue mounting of DFS ROOT when redirecting from one domain controller to the next ... Browse Code »

commit 1daaae8fa4afe3df78ca34e724ed7e8187e4eb32 upstream.

This patch fixes an issue when cifs_mount receives a
STATUS_BAD_NETWORK_NAME error during cifs_get_tcon but is able to
continue after an DFS ROOT referral. In this case, the return code
variable is not reset prior to trying to mount from the system referred
to. Thus, is_path_accessible is not executed and the final DFS referral
is not performed causing a mount error.

Use case: In DNS, example.com resolves to the secondary AD server
ad2.example.com Our primary domain controller is ad1.example.com and has
a DFS redirection set up from \\ad1\share\Users to \\files\share\Users.
Mounting \\example.com\share\Users fails.

Regression introduced by commit 724d9f1.

Reviewed-by: Pavel Shilovsky
Signed-off-by: Jeff Layton
Signed-off-by: Steve French
Signed-off-by: Greg Kroah-Hartman

Jeff Layton
2012-04-03 01:32:07 +0800
ffb07758c CIFS: Respect negotiated MaxMpxCount ... Browse Code »

commit 10b9b98e41ba248a899f6175ce96ee91431b6194 upstream.

Some servers sets this value less than 50 that was hardcoded and
we lost the connection if when we exceed this limit. Fix this by
respecting this value - not sending more than the server allows.

Reviewed-by: Jeff Layton
Signed-off-by: Pavel Shilovsky
Signed-off-by: Steve French
Signed-off-by: Greg Kroah-Hartman

Pavel Shilovsky
2012-04-03 01:32:07 +0800
16223d0d6 xfs: fix inode lookup race ... Browse Code »

commit f30d500f809eca67a21704347ab14bb35877b5ee upstream.

When we get concurrent lookups of the same inode that is not in the
per-AG inode cache, there is a race condition that triggers warnings
in unlock_new_inode() indicating that we are initialising an inode
that isn't in a the correct state for a new inode.

When we do an inode lookup via a file handle or a bulkstat, we don't
serialise lookups at a higher level through the dentry cache (i.e.
pathless lookup), and so we can get concurrent lookups of the same
inode.

The race condition is between the insertion of the inode into the
cache in the case of a cache miss and a concurrently lookup:

Thread 1 Thread 2
xfs_iget()
xfs_iget_cache_miss()
xfs_iread()
lock radix tree
radix_tree_insert()
rcu_read_lock
radix_tree_lookup
lock inode flags
XFS_INEW not set
igrab()
unlock inode flags
rcu_read_unlock
use uninitialised inode
.....
lock inode flags
set XFS_INEW
unlock inode flags
unlock radix tree
xfs_setup_inode()
inode flags = I_NEW
unlock_new_inode()
WARNING as inode flags != I_NEW

This can lead to inode corruption, inode list corruption, etc, and
is generally a bad thing to occur.

Fix this by setting XFS_INEW before inserting the inode into the
radix tree. This will ensure any concurrent lookup will find the new
inode with XFS_INEW set and that forces the lookup to wait until the
XFS_INEW flag is removed before allowing the lookup to succeed.

Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Ben Myers
Signed-off-by: Greg Kroah-Hartman

Dave Chinner
2012-04-03 01:32:07 +0800
db72f8257 NFSv4: Return the delegation if the server returns NFS4ERR_OPENMODE ... Browse Code »

commit 3114ea7a24d3264c090556a2444fc6d2c06176d4 upstream.

If a setattr() fails because of an NFS4ERR_OPENMODE error, it is
probably due to us holding a read delegation. Ensure that the
recovery routines return that delegation in this case.

Reported-by: Miklos Szeredi
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2012-04-03 01:32:07 +0800
87ff507d3 NFS: Properly handle the case where the delegation is revoked ... Browse Code »

commit a1d0b5eebc4fd6e0edb02688b35f17f67f42aea5 upstream.

If we know that the delegation stateid is bad or revoked, we need to
remove that delegation as soon as possible, and then mark all the
stateids that relied on that delegation for recovery. We cannot use
the delegation as part of the recovery process.

Also note that NFSv4.1 uses a different error code (NFS4ERR_DELEG_REVOKED)
to indicate that the delegation was revoked.

Finally, ensure that setlk() and setattr() can both recover safely from
a revoked delegation.

Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2012-04-03 01:32:07 +0800
a3d921898 hugetlbfs: avoid taking i_mutex from hugetlbfs_read() ... Browse Code »

commit a05b0855fd15504972dba2358e5faa172a1e50ba upstream.

Taking i_mutex in hugetlbfs_read() can result in deadlock with mmap as
explained below

Thread A:
read() on hugetlbfs
hugetlbfs_read() called
i_mutex grabbed
hugetlbfs_read_actor() called
__copy_to_user() called
page fault is triggered
Thread B, sharing address space with A:
mmap() the same file
->mmap_sem is grabbed on task_B->mm->mmap_sem
hugetlbfs_file_mmap() is called
attempt to grab ->i_mutex and block waiting for A to give it up
Thread A:
pagefault handled blocked on attempt to grab task_A->mm->mmap_sem,
which happens to be the same thing as task_B->mm->mmap_sem. Block waiting
for B to give it up.

AFAIU the i_mutex locking was added to hugetlbfs_read() as per
http://lkml.indiana.edu/hypermail/linux/kernel/0707.2/3066.html to take
care of the race between truncate and read. This patch fixes this by
looking at page->mapping under lock_page() (find_lock_page()) to ensure
that the inode didn't get truncated in the range during a parallel read.

Ideally we can extend the patch to make sure we don't increase i_size in
mmap. But that will break userspace, because applications will now have
to use truncate(2) to increase i_size in hugetlbfs.

Based on the original patch from Hillf Danton.

Signed-off-by: Aneesh Kumar K.V
Cc: Hillf Danton
Cc: KAMEZAWA Hiroyuki
Cc: Al Viro
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Aneesh Kumar K.V
2012-04-03 01:31:54 +0800
4a1d70419 mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode ... Browse Code »

commit 1a5a9906d4e8d1976b701f889d8f35d54b928f25 upstream.

In some cases it may happen that pmd_none_or_clear_bad() is called with
the mmap_sem hold in read mode. In those cases the huge page faults can
allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
false positive from pmd_bad() that will not like to see a pmd
materializing as trans huge.

It's not khugepaged causing the problem, khugepaged holds the mmap_sem
in write mode (and all those sites must hold the mmap_sem in read mode
to prevent pagetables to go away from under them, during code review it
seems vm86 mode on 32bit kernels requires that too unless it's
restricted to 1 thread per process or UP builds). The race is only with
the huge pagefaults that can convert a pmd_none() into a
pmd_trans_huge().

Effectively all these pmd_none_or_clear_bad() sites running with
mmap_sem in read mode are somewhat speculative with the page faults, and
the result is always undefined when they run simultaneously. This is
probably why it wasn't common to run into this. For example if the
madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
fault, the hugepage will not be zapped, if the page fault runs first it
will be zapped.

Altering pmd_bad() not to error out if it finds hugepmds won't be enough
to fix this, because zap_pmd_range would then proceed to call
zap_pte_range (which would be incorrect if the pmd become a
pmd_trans_huge()).

The simplest way to fix this is to read the pmd in the local stack
(regardless of what we read, no need of actual CPU barriers, only
compiler barrier needed), and be sure it is not changing under the code
that computes its value. Even if the real pmd is changing under the
value we hold on the stack, we don't care. If we actually end up in
zap_pte_range it means the pmd was not none already and it was not huge,
and it can't become huge from under us (khugepaged locking explained
above).

All we need is to enforce that there is no way anymore that in a code
path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
can run into a hugepmd. The overhead of a barrier() is just a compiler
tweak and should not be measurable (I only added it for THP builds). I
don't exclude different compiler versions may have prevented the race
too by caching the value of *pmd on the stack (that hasn't been
verified, but it wouldn't be impossible considering
pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
and there's no external function called in between pmd_trans_huge and
pmd_none_or_clear_bad).

if (pmd_trans_huge(*pmd)) {
if (next-addr != HPAGE_PMD_SIZE) {
VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
split_huge_page_pmd(vma->vm_mm, pmd);
} else if (zap_huge_pmd(tlb, vma, pmd, addr))
continue;
/* fall through */
}
if (pmd_none_or_clear_bad(pmd))

Because this race condition could be exercised without special
privileges this was reported in CVE-2012-1179.

The race was identified and fully explained by Ulrich who debugged it.
I'm quoting his accurate explanation below, for reference.

====== start quote =======
mapcount 0 page_mapcount 1
kernel BUG at mm/huge_memory.c:1384!

At some point prior to the panic, a "bad pmd ..." message similar to the
following is logged on the console:

mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).

The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
the page's PMD table entry.

143 void pmd_clear_bad(pmd_t *pmd)
144 {
-> 145 pmd_ERROR(*pmd);
146 pmd_clear(pmd);
147 }

After the PMD table entry has been cleared, there is an inconsistency
between the actual number of PMD table entries that are mapping the page
and the page's map count (_mapcount field in struct page). When the page
is subsequently reclaimed, __split_huge_page() detects this inconsistency.

1381 if (mapcount != page_mapcount(page))
1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
1383 mapcount, page_mapcount(page));
-> 1384 BUG_ON(mapcount != page_mapcount(page));

The root cause of the problem is a race of two threads in a multithreaded
process. Thread B incurs a page fault on a virtual address that has never
been accessed (PMD entry is zero) while Thread A is executing an madvise()
system call on a virtual address within the same 2 MB (huge page) range.

virtual address space
.---------------------.
| |
| |
.-|---------------------|
| | |
| | |< |/////////////////////| > A(range)
page | |/////////////////////|-'
| | |
| | |
'-|---------------------|
| |
| |
'---------------------'

- Thread A is executing an madvise(..., MADV_DONTNEED) system call
on the virtual address range "A(range)" shown in the picture.

sys_madvise
// Acquire the semaphore in shared mode.
down_read(¤t->mm->mmap_sem)
...
madvise_vma
switch (behavior)
case MADV_DONTNEED:
madvise_dontneed
zap_page_range
unmap_vmas
unmap_page_range
zap_pud_range
zap_pmd_range
//
// Assume that this huge page has never been accessed.
// I.e. content of the PMD entry is zero (not mapped).
//
if (pmd_trans_huge(*pmd)) {
// We don't get here due to the above assumption.
}
//
// Assume that Thread B incurred a page fault and
.---------> // sneaks in here as shown below.
| //
| if (pmd_none_or_clear_bad(pmd))
| {
| if (unlikely(pmd_bad(*pmd)))
| pmd_clear_bad
| {
| pmd_ERROR
| // Log "bad pmd ..." message here.
| pmd_clear
| // Clear the page's PMD entry.
| // Thread B incremented the map count
| // in page_add_new_anon_rmap(), but
| // now the page is no longer mapped
| // by a PMD entry (-> inconsistency).
| }
| }
|
v
- Thread B is handling a page fault on virtual address "B(fault)" shown
in the picture.

...
do_page_fault
__do_page_fault
// Acquire the semaphore in shared mode.
down_read_trylock(&mm->mmap_sem)
...
handle_mm_fault
if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
// We get here due to the above assumption (PMD entry is zero).
do_huge_pmd_anonymous_page
alloc_hugepage_vma
// Allocate a new transparent huge page here.
...
__do_huge_pmd_anonymous_page
...
spin_lock(&mm->page_table_lock)
...
page_add_new_anon_rmap
// Here we increment the page's map count (starts at -1).
atomic_set(&page->_mapcount, 0)
set_pmd_at
// Here we set the page's PMD entry which will be cleared
// when Thread A calls pmd_clear_bad().
...
spin_unlock(&mm->page_table_lock)

The mmap_sem does not prevent the race because both threads are acquiring
it in shared mode (down_read). Thread B holds the page_table_lock while
the page's map count and PMD table entry are updated. However, Thread A
does not synchronize on that lock.

====== end quote =======

[akpm@linux-foundation.org: checkpatch fixes]
Reported-by: Ulrich Obergfell
Signed-off-by: Andrea Arcangeli
Acked-by: Johannes Weiner
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: Dave Jones
Acked-by: Larry Woodman
Acked-by: Rik van Riel
Cc: Mark Salter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Andrea Arcangeli
2012-04-03 01:31:53 +0800
296dd3428 sysfs: Fix memory leak in sysfs_sd_setsecdata(). ... Browse Code »

commit 93518dd2ebafcc761a8637b2877008cfd748c202 upstream.

This patch fixies follwing two memory leak patterns that reported by kmemleak.
sysfs_sd_setsecdata() is called during sys_lsetxattr() operation.
It checks sd->s_iattr is NULL or not. Then if it is NULL, it calls
sysfs_init_inode_attrs() to allocate memory.
That code is this.

iattrs = sd->s_iattr;
if (!iattrs)
iattrs = sysfs_init_inode_attrs(sd);

The iattrs recieves sysfs_init_inode_attrs()'s result, but sd->s_iattr
doesn't know the address. so it needs to set correct address to
sd->s_iattr to free memory in other function.

unreferenced object 0xffff880250b73e60 (size 32):
comm "systemd", pid 1, jiffies 4294683888 (age 94.553s)
hex dump (first 32 bytes):
73 79 73 74 65 6d 5f 75 3a 6f 62 6a 65 63 74 5f system_u:object_
72 3a 73 79 73 66 73 5f 74 3a 73 30 00 00 00 00 r:sysfs_t:s0....
backtrace:
[] kmemleak_alloc+0x73/0x98
[] __kmalloc+0x100/0x12c
[] context_struct_to_string+0x106/0x210
[] security_sid_to_context_core+0x10b/0x129
[] security_sid_to_context+0x10/0x12
[] selinux_inode_getsecurity+0x7d/0xa8
[] selinux_inode_getsecctx+0x22/0x2e
[] security_inode_getsecctx+0x16/0x18
[] sysfs_setxattr+0x96/0x117
[] __vfs_setxattr_noperm+0x73/0xd9
[] vfs_setxattr+0x83/0xa1
[] setxattr+0xcf/0x101
[] sys_lsetxattr+0x6a/0x8f
[] system_call_fastpath+0x16/0x1b
[] 0xffffffffffffffff
unreferenced object 0xffff88024163c5a0 (size 96):
comm "systemd", pid 1, jiffies 4294683888 (age 94.553s)
hex dump (first 32 bytes):
00 00 00 00 ed 41 00 00 00 00 00 00 00 00 00 00 .....A..........
00 00 00 00 00 00 00 00 0c 64 42 4f 00 00 00 00 .........dBO....
backtrace:
[] kmemleak_alloc+0x73/0x98
[] kmem_cache_alloc_trace+0xc4/0xee
[] sysfs_init_inode_attrs+0x2a/0x83
[] sysfs_setxattr+0xbf/0x117
[] __vfs_setxattr_noperm+0x73/0xd9
[] vfs_setxattr+0x83/0xa1
[] setxattr+0xcf/0x101
[] sys_lsetxattr+0x6a/0x8f
[] system_call_fastpath+0x16/0x1b
[] 0xffffffffffffffff
`

Signed-off-by: Masami Ichikawa
Signed-off-by: Greg Kroah-Hartman

Masami Ichikawa
2012-04-03 01:31:34 +0800

19 Mar, 2012

1 commit

93dc6107a Don't limit non-nested epoll paths ... Browse Code »
1

Commit 28d82dc1c4ed ("epoll: limit paths") that I did to limit the
number of possible wakeup paths in epoll is causing a few applications
to longer work (dovecot for one).

The original patch is really about limiting the amount of epoll nesting
(since epoll fds can be attached to other fds). Thus, we probably can
allow an unlimited number of paths of depth 1. My current patch limits
it at 1000. And enforce the limits on paths that have a greater depth.

This is captured in: https://bugzilla.redhat.com/show_bug.cgi?id=681578

Signed-off-by: Jason Baron
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Jason Baron
2012-03-19 03:25:04 +0800

17 Mar, 2012

5 commits

cb1ecf25a Merge branch 'akpm' (more patches from Andrew) ... Browse Code »

Merge some more email patches from Andrew Morton:
"A couple of nilfs fixes"

* emailed from Andrew Morton :
nilfs2: fix NULL pointer dereference in nilfs_load_super_block()
nilfs2: clamp ns_r_segments_percentage to [1, 99]

Linus Torvalds
2012-03-17 08:14:55 +0800
d7178c79d nilfs2: fix NULL pointer dereference in nilfs_load_super_block() ... Browse Code »
1

According to the report from Slicky Devil, nilfs caused kernel oops at
nilfs_load_super_block function during mount after he shrank the
partition without resizing the filesystem:

BUG: unable to handle kernel NULL pointer dereference at 00000048
IP: [] nilfs_load_super_block+0x17e/0x280 [nilfs2]
*pde = 00000000
Oops: 0000 [#1] PREEMPT SMP
...
Call Trace:
[] init_nilfs+0x4b/0x2e0 [nilfs2]
[] nilfs_mount+0x447/0x5b0 [nilfs2]
[] mount_fs+0x36/0x180
[] vfs_kern_mount+0x51/0xa0
[] do_kern_mount+0x3e/0xe0
[] do_mount+0x169/0x700
[] sys_mount+0x6b/0xa0
[] sysenter_do_call+0x12/0x28
Code: 53 18 8b 43 20 89 4b 18 8b 4b 24 89 53 1c 89 43 24 89 4b 20 8b 43
20 c7 43 2c 00 00 00 00 23 75 e8 8b 50 68 89 53 28 8b 54 b3 20 72
48 8b 7a 4c 8b 55 08 89 b3 84 00 00 00 89 bb 88 00 00 00
EIP: [] nilfs_load_super_block+0x17e/0x280 [nilfs2] SS:ESP 0068:ca9bbdcc
CR2: 0000000000000048

This turned out due to a defect in an error path which runs if the
calculated location of the secondary super block was invalid.

This patch fixes it and eliminates the reported oops.

Reported-by: Slicky Devil
Signed-off-by: Ryusuke Konishi
Tested-by: Slicky Devil
Cc: [2.6.30+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ryusuke Konishi
2012-03-17 08:14:44 +0800
3d777a640 nilfs2: clamp ns_r_segments_percentage to [1, 99] ... Browse Code »

ns_r_segments_percentage is read from the disk. Bogus or malicious
value could cause integer overflow and malfunction due to meaningless
disk usage calculation. This patch reports error when mounting such
bogus volumes.

Signed-off-by: Haogang Chen
Signed-off-by: Ryusuke Konishi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Haogang Chen
2012-03-17 08:14:44 +0800
c01738635 afs: Remote abort can cause BUG in rxrpc code ... Browse Code »
1

When writing files to afs I sometimes hit a BUG:

kernel BUG at fs/afs/rxrpc.c:179!

With a backtrace of:

afs_free_call
afs_make_call
afs_fs_store_data
afs_vnode_store_data
afs_write_back_from_locked_page
afs_writepages_region
afs_writepages

The cause is:

ASSERT(skb_queue_empty(&call->rx_queue));

Looking at a tcpdump of the session the abort happens because we
are exceeding our disk quota:

rx abort fs reply store-data error diskquota exceeded (32)

So the abort error is valid. We hit the BUG because we haven't
freed all the resources for the call.

By freeing any skbs in call->rx_queue before calling afs_free_call
we avoid hitting leaking memory and avoid hitting the BUG.

Signed-off-by: Anton Blanchard
Signed-off-by: David Howells
Cc:
Signed-off-by: Linus Torvalds

Anton Blanchard
2012-03-17 08:01:41 +0800
2c724fb92 afs: Read of file returns EBADMSG ... Browse Code »
1

A read of a large file on an afs mount failed:

# cat junk.file > /dev/null
cat: junk.file: Bad message

Looking at the trace, call->offset wrapped since it is only an
unsigned short. In afs_extract_data:

_enter("{%u},{%zu},%d,,%zu", call->offset, len, last, count);
...

if (call->offset < count) {
if (last) {
_leave(" = -EBADMSG [%d < %zu]", call->offset, count);
return -EBADMSG;
}

Which matches the trace:

[cat ] ==> afs_extract_data({65132},{524},1,,65536)
[cat ] < 65536]

call->offset went from 65132 to 0. Fix this by making call->offset an
unsigned int.

Signed-off-by: Anton Blanchard
Signed-off-by: David Howells
Cc:
Signed-off-by: Linus Torvalds

Anton Blanchard
2012-03-17 08:01:41 +0800

15 Mar, 2012

1 commit

f1cbd03f5 Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block fixes from Jens Axboe:
"Been sitting on this for a while, but lets get this out the door.
This fixes various important bugs for 3.3 final, along with a few more
trivial ones. Please pull!"

* 'for-linus' of git://git.kernel.dk/linux-block:
block: fix ioc leak in put_io_context
block, sx8: fix pointer math issue getting fw version
Block: use a freezable workqueue for disk-event polling
drivers/block/DAC960: fix -Wuninitialized warning
drivers/block/DAC960: fix DAC960_V2_IOCTL_Opcode_T -Wenum-compare warning
block: fix __blkdev_get and add_disk race condition
block: Fix setting bio flags in drivers (sd_dif/floppy)
block: Fix NULL pointer dereference in sd_revalidate_disk
block: exit_io_context() should call elevator_exit_icq_fn()
block: simplify ioc_release_fn()
block: replace icq->changed with icq->flags

Linus Torvalds
2012-03-15 08:16:45 +0800

14 Mar, 2012

1 commit

8e8bb96d2 Merge git://git.samba.org/sfrench/cifs-2.6 ... Browse Code »

Pull CIFS fixes from Steve French.

* git://git.samba.org/sfrench/cifs-2.6:
CIFS: Do not kmalloc under the flocks spinlock
cifs: possible memory leak in xattr.

Linus Torvalds
2012-03-14 08:03:53 +0800

11 Mar, 2012

5 commits

310fa7a36 restore smp_mb() in unlock_new_inode() ... Browse Code »

wait_on_inode() doesn't have ->i_lock

Signed-off-by: Al Viro

Al Viro
2012-03-11 06:07:28 +0800
7f6c7e62f vfs: fix return value from do_last() ... Browse Code »
1

complete_walk() returns either ECHILD or ESTALE. do_last() turns this into
ECHILD unconditionally. If not in RCU mode, this error will reach userspace
which is complete nonsense.

Signed-off-by: Miklos Szeredi
CC: stable@vger.kernel.org
Signed-off-by: Al Viro

Miklos Szeredi
2012-03-11 06:05:30 +0800
097b180ca vfs: fix double put after complete_walk() ... Browse Code »
1

complete_walk() already puts nd->path, no need to do it again at cleanup time.

This would result in Oopses if triggered, apparently the codepath is not too
well exercised.

Signed-off-by: Miklos Szeredi
CC: stable@vger.kernel.org
Signed-off-by: Al Viro

Miklos Szeredi
2012-03-11 06:05:30 +0800
f6940fe90 udf: Fix deadlock in udf_release_file() ... Browse Code »

udf_release_file() can be called from munmap() path with mmap_sem held. Thus
we cannot take i_mutex there because that ranks above mmap_sem. Luckily,
i_mutex is not needed in udf_release_file() anymore since protection by
i_data_sem is enough to protect from races with write and truncate.

Reported-by: Al Viro
Reviewed-by: Namjae Jeon
Signed-off-by: Jan Kara
Signed-off-by: Al Viro

Jan Kara
2012-03-11 05:05:38 +0800
978d6d8c4 vfs: Correctly set the dir i_mutex lockdep class ... Browse Code »

9a7aa12f3911853a introduced additional logic around setting the i_mutex
lockdep class for directory inodes. The idea was that some filesystems
may want their own special lockdep class for different directory
inodes and calling unlock_new_inode() should not clobber one of
those special classes.

I believe that the added conditional, around the *negated* return value
of lockdep_match_class(), caused directory inodes to be placed in the
wrong lockdep class.

inode_init_always() sets the i_mutex lockdep class with i_mutex_key for
all inodes. If the filesystem did not change the class during inode
initialization, then the conditional mentioned above was false and the
directory inode was incorrectly left in the non-directory lockdep class.
If the filesystem did set a special lockdep class, then the conditional
mentioned above was true and that class was clobbered with
i_mutex_dir_key.

This patch removes the negation from the conditional so that the i_mutex
lockdep class is properly set for directory inodes. Special classes are
preserved and directory inodes with unmodified classes are set with
i_mutex_dir_key.

Signed-off-by: Tyler Hicks
Reviewed-by: Jan Kara
Signed-off-by: Al Viro

Tyler Hicks
2012-03-11 05:05:38 +0800

10 Mar, 2012

3 commits

c7b285550 aio: fix the "too late munmap()" race ... Browse Code »
1

Current code has put_ioctx() called asynchronously from aio_fput_routine();
that's done *after* we have killed the request that used to pin ioctx,
so there's nothing to stop io_destroy() waiting in wait_for_all_aios()
from progressing. As the result, we can end up with async call of
put_ioctx() being the last one and possibly happening during exit_mmap()
or elf_core_dump(), neither of which expects stray munmap() being done
to them...

We do need to prevent _freeing_ ioctx until aio_fput_routine() is done
with that, but that's all we care about - neither io_destroy() nor
exit_aio() will progress past wait_for_all_aios() until aio_fput_routine()
does really_put_req(), so the ioctx teardown won't be done until then
and we don't care about the contents of ioctx past that point.

Since actual freeing of these suckers is RCU-delayed, we don't need to
bump ioctx refcount when request goes into list for async removal.
All we need is rcu_read_lock held just over the ->ctx_lock-protected
area in aio_fput_routine().

Signed-off-by: Al Viro
Reviewed-by: Jeff Moyer
Acked-by: Benjamin LaHaise
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Al Viro
2012-03-10 10:59:59 +0800
86b62a2cb aio: fix io_setup/io_destroy race ... Browse Code »
1

Have ioctx_alloc() return an extra reference, so that caller would drop it
on success and not bother with re-grabbing it on failure exit. The current
code is obviously broken - io_destroy() from another thread that managed
to guess the address io_setup() would've returned would free ioctx right
under us; gets especially interesting if aio_context_t * we pass to
io_setup() points to PROT_READ mapping, so put_user() fails and we end
up doing io_destroy() on kioctx another thread has just got freed...

Signed-off-by: Al Viro
Acked-by: Benjamin LaHaise
Reviewed-by: Jeff Moyer
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Al Viro
2012-03-10 10:59:59 +0800
86e060083 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull btrfs updates from Chris Mason:
"I have two additional and btrfs fixes in my for-linus branch. One is
a casting error that leads to memory corruption on i386 during scrub,
and the other fixes a corner case in the backref walking code (also
triggered by scrub)."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix casting error in scrub reada code
btrfs: fix locking issues in find_parent_nodes()

Linus Torvalds
2012-03-10 10:09:18 +0800

07 Mar, 2012

3 commits

d5751469f CIFS: Do not kmalloc under the flocks spinlock ... Browse Code »

Reorganize the code to make the memory already allocated before
spinlock'ed loop.

Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton
Signed-off-by: Pavel Shilovsky
Signed-off-by: Steve French

Pavel Shilovsky
2012-03-07 11:50:15 +0800
b0f8ef202 cifs: possible memory leak in xattr. ... Browse Code »

Memory is allocated irrespective of whether CIFS_ACL is configured
or not. But free is happenning only if CIFS_ACL is set. This is a
possible memory leak scenario.

Fix is:
Allocate and free memory only if CIFS_ACL is configured.

Signed-off-by: Santosh Nayak
Reviewed-by: Shirish Pargaonkar
Signed-off-by: Steve French

Santosh Nayak
2012-03-07 11:46:53 +0800
71fece951 Merge git://git.samba.org/sfrench/cifs-2.6 ... Browse Code »

Pull CIFS fixes from Steve French

* git://git.samba.org/sfrench/cifs-2.6:
cifs: fix dentry refcount leak when opening a FIFO on lookup
CIFS: Fix mkdir/rmdir bug for the non-POSIX case

Linus Torvalds
2012-03-07 08:55:50 +0800

06 Mar, 2012

1 commit

3e85fb9cd Merge branch 'akpm' (Andrew's patch bomb) ... Browse Code »

Merge the emailed seties of 19 patches from Andrew Morton

* akpm:
rapidio/tsi721: fix queue wrapping bug in inbound doorbell handler
memcg: fix mapcount check in move charge code for anonymous page
mm: thp: fix BUG on mm->nr_ptes
alpha: fix 32/64-bit bug in futex support
memcg: fix GPF when cgroup removal races with last exit
debugobjects: Fix selftest for static warnings
floppy/scsi: fix setting of BIO flags
memcg: fix deadlock by inverting lrucare nesting
drivers/rtc/rtc-r9701.c: fix crash in r9701_remove()
c2port: class_create() returns an ERR_PTR
pps: class_create() returns an ERR_PTR, not NULL
hung_task: fix the broken rcu_lock_break() logic
vfork: kill PF_STARTING
coredump_wait: don't call complete_vfork_done()
vfork: make it killable
vfork: introduce complete_vfork_done()
aio: wake up waiters when freeing unused kiocbs
kprobes: return proper error code from register_kprobe()
kmsg_dump: don't run on non-error paths by default

Linus Torvalds
2012-03-06 07:50:25 +0800