Doug / smarc-fsl-linux-kernel | Embedian Git Server

14 Jan, 2011

3 commits

c585a2678 mm: remove likely() from grab_cache_page_write_begin() ... Browse Code »

Running the annotated branch profiler on a box doing average work
(firefox, evolution, xchat, distcc farm), the likely() used in
grab_cache_page_write_begin() was incorrect most of the time:

correct incorrect % Function File Line
------- --------- - -------- ---- ----
1924262 71332401 97 grab_cache_page_write_begin filemap.c 2206

Adding a trace_printk() and running the function tracer limited to
just this function I can see:

gconfd-2-2696 [000] 4467.268935: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=7
gconfd-2-2696 [000] 4467.268946: grab_cache_page_write_begin
Acked-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Steven Rostedt
2011-01-14 09:32:36 +0800
212260aa0 mm: clear PageError bit in msync & fsync ... Browse Code »

Temporary IO failures, eg. due to loss of both multipath paths, can
permanently leave the PageError bit set on a page, resulting in msync or
fsync returning -EIO over and over again, even if IO is now getting to the
disk correctly.

We already clear the AS_ENOSPC and AS_IO bits in mapping->flags in the
filemap_fdatawait_range function. Also clearing the PageError bit on the
page allows subsequent msync or fsync calls on this file to return without
an error, if the subsequent IO succeeds.

Unfortunately data written out in the msync or fsync call that returned
-EIO can still get lost, because the page dirty bit appears to not get
restored on IO error. However, the alternative could be potentially all
of memory filling up with uncleanable dirty pages, hanging the system, so
there is no nice choice here...

Signed-off-by: Rik van Riel
Acked-by: Valerie Aurora
Acked-by: Jeff Layton
Cc: Theodore Ts'o
Acked-by: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2011-01-14 09:32:35 +0800
9cbb4cb21 mm: find_get_pages_contig fixlet ... Browse Code »

Testing ->mapping and ->index without a ref is not stable as the page
may have been reused at this point.

Signed-off-by: Nick Piggin
Reviewed-by: Wu Fengguang
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2011-01-14 09:32:32 +0800

07 Jan, 2011

1 commit

b5c84bf6f fs: dcache remove dcache_lock ... Browse Code »

dcache_lock no longer protects anything. remove it.

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:23 +0800

02 Dec, 2010

1 commit

6072d13c4 Call the filesystem back whenever a page is removed from the page cache ... Browse Code »

NFS needs to be able to release objects that are stored in the page
cache once the page itself is no longer visible from the page cache.

This patch adds a callback to the address space operations that allows
filesystems to perform page cleanups once the page has been removed
from the page cache.

Original patch by: Linus Torvalds
[trondmy: cover the cases of invalidate_inode_pages2() and
truncate_inode_pages()]
Signed-off-by: Trond Myklebust

Linus Torvalds
2010-12-02 22:55:21 +0800

12 Nov, 2010

2 commits

27d20fddc radix-tree: fix RCU bug ... Browse Code »

Salman Qazi describes the following radix-tree bug:

In the following case, we get can get a deadlock:

0. The radix tree contains two items, one has the index 0.
1. The reader (in this case find_get_pages) takes the rcu_read_lock.
2. The reader acquires slot(s) for item(s) including the index 0 item.
3. The non-zero index item is deleted, and as a consequence the other item is
moved to the root of the tree. The place where it used to be is queued for
deletion after the readers finish.
3b. The zero item is deleted, removing it from the direct slot, it remains in
the rcu-delayed indirect node.
4. The reader looks at the index 0 slot, and finds that the page has 0 ref
count
5. The reader looks at it again, hoping that the item will either be freed or
the ref count will increase. This never happens, as the slot it is looking
at will never be updated. Also, this slot can never be reclaimed because
the reader is holding rcu_read_lock and is in an infinite loop.

The fix is to re-use the same "indirect" pointer case that requires a slot
lookup retry into a general "retry the lookup" bit.

Signed-off-by: Nick Piggin
Reported-by: Salman Qazi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2010-11-12 23:55:32 +0800
8d056cb96 mm/vfs: revalidate page->mapping in do_generic_file_read() ... Browse Code »

70 hours into some stress tests of a 2.6.32-based enterprise kernel, we
ran into a NULL dereference in here:

int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
unsigned long from)
{
----> struct inode *inode = page->mapping->host;

It looks like page->mapping was the culprit. (xmon trace is below).
After closer examination, I realized that do_generic_file_read() does a
find_get_page(), and eventually locks the page before calling
block_is_partially_uptodate(). However, it doesn't revalidate the
page->mapping after the page is locked. So, there's a small window
between the find_get_page() and ->is_partially_uptodate() where the page
could get truncated and page->mapping cleared.

We _have_ a reference, so it can't get reclaimed, but it certainly
can be truncated.

I think the correct thing is to check page->mapping after the
trylock_page(), and jump out if it got truncated. This patch has been
running in the test environment for a month or so now, and we have not
seen this bug pop up again.

xmon info:

1f:mon> e
cpu 0x1f: Vector: 300 (Data Access) at [c0000002ae36f770]
pc: c0000000001e7a6c: .block_is_partially_uptodate+0xc/0x100
lr: c000000000142944: .generic_file_aio_read+0x1e4/0x770
sp: c0000002ae36f9f0
msr: 8000000000009032
dar: 0
dsisr: 40000000
current = 0xc000000378f99e30
paca = 0xc000000000f66300
pid = 21946, comm = bash
1f:mon> r
R00 = 0025c0500000006d R16 = 0000000000000000
R01 = c0000002ae36f9f0 R17 = c000000362cd3af0
R02 = c000000000e8cd80 R18 = ffffffffffffffff
R03 = c0000000031d0f88 R19 = 0000000000000001
R04 = c0000002ae36fa68 R20 = c0000003bb97b8a0
R05 = 0000000000000000 R21 = c0000002ae36fa68
R06 = 0000000000000000 R22 = 0000000000000000
R07 = 0000000000000001 R23 = c0000002ae36fbb0
R08 = 0000000000000002 R24 = 0000000000000000
R09 = 0000000000000000 R25 = c000000362cd3a80
R10 = 0000000000000000 R26 = 0000000000000002
R11 = c0000000001e7b60 R27 = 0000000000000000
R12 = 0000000042000484 R28 = 0000000000000001
R13 = c000000000f66300 R29 = c0000003bb97b9b8
R14 = 0000000000000001 R30 = c000000000e28a08
R15 = 000000000000ffff R31 = c0000000031d0f88
pc = c0000000001e7a6c .block_is_partially_uptodate+0xc/0x100
lr = c000000000142944 .generic_file_aio_read+0x1e4/0x770
msr = 8000000000009032 cr = 22000488
ctr = c0000000001e7a60 xer = 0000000020000000 trap = 300
dar = 0000000000000000 dsisr = 40000000
1f:mon> t
[link register ] c000000000142944 .generic_file_aio_read+0x1e4/0x770
[c0000002ae36f9f0] c000000000142a14 .generic_file_aio_read+0x2b4/0x770 (unreliable)
[c0000002ae36fb40] c0000000001b03e4 .do_sync_read+0xd4/0x160
[c0000002ae36fce0] c0000000001b153c .vfs_read+0xec/0x1f0
[c0000002ae36fd80] c0000000001b1768 .SyS_read+0x58/0xb0
[c0000002ae36fe30] c00000000000852c syscall_exit+0x0/0x40
--- Exception: c00 (System Call) at 00000080a840bc54
SP (fffca15df30) is in userspace
1f:mon> di c0000000001e7a6c
c0000000001e7a6c e9290000 ld r9,0(r9)
c0000000001e7a70 418200c0 beq c0000000001e7b30 # .block_is_partially_uptodate+0xd0/0x100
c0000000001e7a74 e9440008 ld r10,8(r4)
c0000000001e7a78 78a80020 clrldi r8,r5,32
c0000000001e7a7c 3c000001 lis r0,1
c0000000001e7a80 812900a8 lwz r9,168(r9)
c0000000001e7a84 39600001 li r11,1
c0000000001e7a88 7c080050 subf r0,r8,r0
c0000000001e7a8c 7f805040 cmplw cr7,r0,r10
c0000000001e7a90 7d6b4830 slw r11,r11,r9
c0000000001e7a94 796b0020 clrldi r11,r11,32
c0000000001e7a98 419d00a8 bgt cr7,c0000000001e7b40 # .block_is_partially_uptodate+0xe0/0x100
c0000000001e7a9c 7fa55840 cmpld cr7,r5,r11
c0000000001e7aa0 7d004214 add r8,r0,r8
c0000000001e7aa4 79080020 clrldi r8,r8,32
c0000000001e7aa8 419c0078 blt cr7,c0000000001e7b20 # .block_is_partially_uptodate+0xc0/0x100

Signed-off-by: Dave Hansen
Reviewed-by: Minchan Kim
Reviewed-by: Johannes Weiner
Acked-by: Rik van Riel
Cc:
Cc:
Cc: Christoph Hellwig
Cc: Al Viro
Cc: Minchan Kim
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2010-11-12 23:55:31 +0800

03 Nov, 2010

1 commit

d88c0922f Release page reference during page fault retry ... Browse Code »

This slipped by when unifying the filemap and swap versions of
lock_page_or_retry()...

Signed-off-by: Michel Lespinasse
Acked-by: Rik van Riel
Signed-off-by: Linus Torvalds

Michel Lespinasse
2010-11-03 05:02:31 +0800

27 Oct, 2010

3 commits

0116651c8 mm: remove temporary variable on generic_file_direct_write() ... Browse Code »

'end' shadows earlier one and is not necessary at all. Remove it and use
'pos' instead. This removes following sparse warnings:

mm/filemap.c:2180:24: warning: symbol 'end' shadows an earlier one
mm/filemap.c:2132:25: originally declared here

Signed-off-by: Namhyung Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Namhyung Kim
2010-10-27 07:52:09 +0800
d065bd810 mm: retry page fault when blocking on disk transfer ... Browse Code »

This change reduces mmap_sem hold times that are caused by waiting for
disk transfers when accessing file mapped VMAs.

It introduces the VM_FAULT_ALLOW_RETRY flag, which indicates that the call
site wants mmap_sem to be released if blocking on a pending disk transfer.
In that case, filemap_fault() returns the VM_FAULT_RETRY status bit and
do_page_fault() will then re-acquire mmap_sem and retry the page fault.

It is expected that the retry will hit the same page which will now be
cached, and thus it will complete with a low mmap_sem hold time.

Tests:

- microbenchmark: thread A mmaps a large file and does random read accesses
to the mmaped area - achieves about 55 iterations/s. Thread B does
mmap/munmap in a loop at a separate location - achieves 55 iterations/s
before, 15000 iterations/s after.

- We are seeing related effects in some applications in house, which show
significant performance regressions when running without this change.

[akpm@linux-foundation.org: fix warning & crash]
Signed-off-by: Michel Lespinasse
Acked-by: Rik van Riel
Acked-by: Linus Torvalds
Cc: Nick Piggin
Reviewed-by: Wu Fengguang
Cc: Ying Han
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Thomas Gleixner
Acked-by: "H. Peter Anvin"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2010-10-27 07:52:09 +0800
b522c94da mm: filemap_fault: unique path for locking page ... Browse Code »

Introduce a single location where filemap_fault() locks the desired page.
There used to be two such places, depending if the initial find_get_page()
was successful or not.

Signed-off-by: Michel Lespinasse
Acked-by: Rik van Riel
Acked-by: Linus Torvalds
Cc: Nick Piggin
Reviewed-by: Wu Fengguang
Cc: Ying Han
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2010-10-27 07:52:09 +0800

10 Aug, 2010

1 commit

4e60c86bd gcc-4.6: mm: fix unused but set warnings ... Browse Code »

No real bugs, just some dead code and some fixups.

Signed-off-by: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2010-08-10 11:44:58 +0800

31 May, 2010

1 commit

003386fff Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
mm: export generic_pipe_buf_*() to modules
fuse: support splice() reading from fuse device
fuse: allow splice to move pages
mm: export remove_from_page_cache() to modules
mm: export lru_cache_add_*() to modules
fuse: support splice() writing to fuse device
fuse: get page reference for readpages
fuse: use get_user_pages_fast()
fuse: remove unneeded variable

Linus Torvalds
2010-05-31 00:16:14 +0800

28 May, 2010

1 commit

105a048a4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (27 commits)
Btrfs: add more error checking to btrfs_dirty_inode
Btrfs: allow unaligned DIO
Btrfs: drop verbose enospc printk
Btrfs: Fix block generation verification race
Btrfs: fix preallocation and nodatacow checks in O_DIRECT
Btrfs: avoid ENOSPC errors in btrfs_dirty_inode
Btrfs: move O_DIRECT space reservation to btrfs_direct_IO
Btrfs: rework O_DIRECT enospc handling
Btrfs: use async helpers for DIO write checksumming
Btrfs: don't walk around with task->state != TASK_RUNNING
Btrfs: do aio_write instead of write
Btrfs: add basic DIO read/write support
direct-io: do not merge logically non-contiguous requests
direct-io: add a hook for the fs to provide its own submit_bio function
fs: allow short direct-io reads to be completed via buffered IO
Btrfs: Metadata ENOSPC handling for balance
Btrfs: Pre-allocate space for data relocation
Btrfs: Metadata ENOSPC handling for tree log
Btrfs: Metadata reservation for orphan inodes
Btrfs: Introduce global metadata reservation
...

Linus Torvalds
2010-05-28 01:43:44 +0800

27 May, 2010

1 commit

91803b499 do_generic_file_read: clear page errors when issuing a fresh read of the page ... Browse Code »

I/O errors can happen due to temporary failures, like multipath
errors or losing network contact with the iSCSI server. Because
of that, the VM will retry readpage on the page.

However, do_generic_file_read does not clear PG_error. This
causes the system to be unable to actually use the data in the
page cache page, even if the subsequent readpage completes
successfully!

The function filemap_fault has had a ClearPageError before
readpage forever. This patch simply adds the same to
do_generic_file_read.

Signed-off-by: Jeff Moyer
Signed-off-by: Rik van Riel
Acked-by: Larry Woodman
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Jeff Moyer
2010-05-27 01:20:27 +0800

25 May, 2010

4 commits

c0ff7453b cpuset,mm: fix no node to alloc memory when changing cpuset's mems ... Browse Code »

Before applying this patch, cpuset updates task->mems_allowed and
mempolicy by setting all new bits in the nodemask first, and clearing all
old unallowed bits later. But in the way, the allocator may find that
there is no node to alloc memory.

The reason is that cpuset rebinds the task's mempolicy, it cleans the
nodes which the allocater can alloc pages on, for example:

(mpol: mempolicy)
task1 task1's mpol task2
alloc page 1
alloc on node0? NO 1
1 change mems from 1 to 0
1 rebind task1's mpol
0-1 set new bits
0 clear disallowed bits
alloc on node1? NO 0
...
can't alloc page
goto oom

This patch fixes this problem by expanding the nodes range first(set newly
allowed bits) and shrink it lazily(clear newly disallowed bits). So we
use a variable to tell the write-side task that read-side task is reading
nodemask, and the write-side task clears newly disallowed nodes after
read-side task ends the current memory allocation.

[akpm@linux-foundation.org: fix spello]
Signed-off-by: Miao Xie
Cc: David Rientjes
Cc: Nick Piggin
Cc: Paul Menage
Cc: Lee Schermerhorn
Cc: Hugh Dickins
Cc: Ravikiran Thirumalai
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miao Xie
2010-05-25 23:06:57 +0800
e9d6c1573 tmpfs: insert tmpfs cache pages to inactive list at first ... Browse Code »

Shaohua Li reported parallel file copy on tmpfs can lead to OOM killer.
This is regression of caused by commit 9ff473b9a7 ("vmscan: evict
streaming IO first"). Wow, It is 2 years old patch!

Currently, tmpfs file cache is inserted active list at first. This means
that the insertion doesn't only increase numbers of pages in anon LRU, but
it also reduces anon scanning ratio. Therefore, vmscan will get totally
confused. It scans almost only file LRU even though the system has plenty
unused tmpfs pages.

Historically, lru_cache_add_active_anon() was used for two reasons.
1) Intend to priotize shmem page rather than regular file cache.
2) Intend to avoid reclaim priority inversion of used once pages.

But we've lost both motivation because (1) Now we have separate anon and
file LRU list. then, to insert active list doesn't help such priotize.
(2) In past, one pte access bit will cause page activation. then to
insert inactive list with pte access bit mean higher priority than to
insert active list. Its priority inversion may lead to uninteded lru
chun. but it was already solved by commit 645747462 (vmscan: detect
mapped file pages used only once). (Thanks Hannes, you are great!)

Thus, now we can use lru_cache_add_anon() instead.

Signed-off-by: KOSAKI Motohiro
Reported-by: Shaohua Li
Reviewed-by: Wu Fengguang
Reviewed-by: Johannes Weiner
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Acked-by: Hugh Dickins
Cc: Henrique de Moraes Holschuh
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-05-25 23:06:56 +0800
66f998f61 fs: allow short direct-io reads to be completed via buffered IO ... Browse Code »

This is similar to what already happens in the write case. If we have a short
read while doing O_DIRECT, instead of just returning, fallthrough and try to
read the rest via buffered IO. BTRFS needs this because if we encounter a
compressed or inline extent during DIO, we need to fallback on buffered. If the
extent is compressed we need to read the entire thing into memory and
de-compress it into the users pages. I have tested this with fsx and everything
works great. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2010-05-25 22:34:55 +0800
a52116aba mm: export remove_from_page_cache() to modules ... Browse Code »

This is needed to enable moving pages into the page cache in fuse with
splice(..., SPLICE_F_MOVE).

Signed-off-by: Miklos Szeredi

Miklos Szeredi
2010-05-25 21:06:06 +0800

30 Mar, 2010

1 commit

5a0e3ad6a include cleanup: Update gfp.h and slab.h includes to prepare for breaking implic… ... Browse Code »

…it slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Tejun Heo
2010-03-30 21:02:32 +0800

07 Mar, 2010

1 commit

59e99e5b9 mm: use rlimit helpers ... Browse Code »

Make sure compiler won't do weird things with limits. E.g. fetching them
twice may return 2 different values after writable limits are implemented.

I.e. either use rlimit helpers added in
3e10e716abf3c71bdb5d86b8f507f9e72236c9cd ("resource: add helpers for
fetching rlimits") or ACCESS_ONCE if not applicable.

Signed-off-by: Jiri Slaby
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Slaby
2010-03-07 03:26:24 +0800

04 Mar, 2010

1 commit

2ecdc82ef kill unused invalidate_inode_pages helper ... Browse Code »

No one is calling this anymore as everyone has switched to
invalidate_mapping_pages long time ago. Also update a few
references to it in comments. nfs has two more, but I can't
easily figure what they are actually referring to, so I left
them as-is.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2010-03-04 03:07:55 +0800

03 Feb, 2010

1 commit

931e80e4b mm: flush dcache before writing into page to avoid alias ... Browse Code »

The cache alias problem will happen if the changes of user shared mapping
is not flushed before copying, then user and kernel mapping may be mapped
into two different cache line, it is impossible to guarantee the coherence
after iov_iter_copy_from_user_atomic. So the right steps should be:

flush_dcache_page(page);
kmap_atomic(page);
write to page;
kunmap_atomic(page);
flush_dcache_page(page);

More precisely, we might create two new APIs flush_dcache_user_page and
flush_dcache_kern_page to replace the two flush_dcache_page accordingly.

Here is a snippet tested on omap2430 with VIPT cache, and I think it is
not ARM-specific:

int val = 0x11111111;
fd = open("abc", O_RDWR);
addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
*(addr+0) = 0x44444444;
tmp = *(addr+0);
*(addr+1) = 0x77777777;
write(fd, &val, sizeof(int));
close(fd);

The results are not always 0x11111111 0x77777777 at the beginning as expected. Sometimes we see 0x44444444 0x77777777.

Signed-off-by: Anfei
Cc: Russell King
Cc: Miklos Szeredi
Cc: Nick Piggin
Cc:
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

anfei zhou
2010-02-03 10:11:21 +0800

28 Jan, 2010

1 commit

0531b2aac mm: add new 'read_cache_page_gfp()' helper function ... Browse Code »

It's a simplified 'read_cache_page()' which takes a page allocation
flag, so that different paths can control how aggressive the memory
allocations are that populate a address space.

In particular, the intel GPU object mapping code wants to be able to do
a certain amount of own internal memory management by automatically
shrinking the address space when memory starts getting tight. This
allows it to dynamically use different memory allocation policies on a
per-allocation basis, rather than depend on the (static) address space
gfp policy.

The actual new function is a one-liner, but re-organizing the helper
functions to the point where you can do this with a single line of code
is what most of the patch is all about.

Tested-by: Chris Wilson
Signed-off-by: Linus Torvalds

Linus Torvalds
2010-01-28 01:20:03 +0800

17 Dec, 2009

1 commit

c05c4edd8 direct I/O fallback sync simplification ... Browse Code »

In the case of direct I/O falling back to buffered I/O we sync data
twice currently: once at the end of generic_file_buffered_write using
filemap_write_and_wait_range and once a little later in
__generic_file_aio_write using do_sync_mapping_range with all flags set.

The wait before write of the do_sync_mapping_range call does not make
any sense, so just keep the filemap_write_and_wait_range call and move
it to the right spot.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2009-12-17 01:16:50 +0800

10 Dec, 2009

1 commit

94004ed72 kill wait_on_page_writeback_range ... Browse Code »

All callers really want the more logical filemap_fdatawait_range interface,
so convert them to use it and merge wait_on_page_writeback_range into
filemap_fdatawait_range.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jan Kara

Christoph Hellwig
2009-12-10 22:02:50 +0800

04 Dec, 2009

1 commit

af901ca18 tree-wide: fix assorted typos all over the place ... Browse Code »

That is "success", "unknown", "through", "performance", "[re|un]mapping"
, "access", "default", "reasonable", "[con]currently", "temperature"
, "channel", "[un]used", "application", "example","hierarchy", "therefore"
, "[over|under]flow", "contiguous", "threshold", "enough" and others.

Signed-off-by: André Goddard Rosa
Signed-off-by: Jiri Kosina

André Goddard Rosa
2009-12-04 22:39:55 +0800

28 Sep, 2009

1 commit

f0f37e2f7 const: mark struct vm_struct_operations ... Browse Code »

* mark struct vm_area_struct::vm_ops as const
* mark vm_ops in AGP code

But leave TTM code alone, something is fishy there with global vm_ops
being used.

Signed-off-by: Alexey Dobriyan
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2009-09-28 02:39:25 +0800

24 Sep, 2009

3 commits

6c5daf012 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
truncate: use new helpers
truncate: new helpers
fs: fix overflow in sys_mount() for in-kernel calls
fs: Make unload_nls() NULL pointer safe
freeze_bdev: grab active reference to frozen superblocks
freeze_bdev: kill bd_mount_sem
exofs: remove BKL from super operations
fs/romfs: correct error-handling code
vfs: seq_file: add helpers for data filling
vfs: remove redundant position check in do_sendfile
vfs: change sb->s_maxbytes to a loff_t
vfs: explicitly cast s_maxbytes in fiemap_check_ranges
libfs: return error code on failed attr set
seq_file: return a negative error code when seq_path_root() fails.
vfs: optimize touch_time() too
vfs: optimization for touch_atime()
vfs: split generic_forget_inode() so that hugetlbfs does not have to copy it
fs/inode.c: add dev-id and inode number for debugging in init_special_inode()
libfs: make simple_read_from_buffer conventional

Linus Torvalds
2009-09-24 23:32:11 +0800
db1682636 Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 ... Browse Code »

* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
HWPOISON: Enable error_remove_page on btrfs
HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
HWPOISON: Add madvise() based injector for hardware poisoned pages v4
HWPOISON: Enable error_remove_page for NFS
HWPOISON: Enable .remove_error_page for migration aware file systems
HWPOISON: The high level memory error handler in the VM v7
HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
HWPOISON: shmem: call set_page_dirty() with locked page
HWPOISON: Define a new error_remove_page address space op for async truncation
HWPOISON: Add invalidate_inode_page
HWPOISON: Refactor truncate to allow direct truncating of page v2
HWPOISON: check and isolate corrupted free pages v2
HWPOISON: Handle hardware poisoned pages in try_to_unmap
HWPOISON: Use bitmask/action code for try_to_unmap behaviour
HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
HWPOISON: Add poison check to page fault handling
HWPOISON: Add basic support for poisoned pages in fault handler v3
HWPOISON: Add new SIGBUS error codes for hardware poison signals
HWPOISON: Add support for poison swap entries v2
HWPOISON: Export some rmap vma locking to outside world
...

Linus Torvalds
2009-09-24 22:53:22 +0800
25d9e2d15 truncate: new helpers ... Browse Code »

Introduce new truncate helpers truncate_pagecache and inode_newsize_ok.
vmtruncate is also consolidated from mm/memory.c and mm/nommu.c and
into mm/truncate.c.

Reviewed-by: Christoph Hellwig
Signed-off-by: Nick Piggin
Signed-off-by: Al Viro

npiggin@suse.de
2009-09-24 20:41:47 +0800

22 Sep, 2009

1 commit

4b02108ac mm: oom analysis: add shmem vmstat ... Browse Code »

Recently we encountered OOM problems due to memory use of the GEM cache.
Generally a large amuont of Shmem/Tmpfs pages tend to create a memory
shortage problem.

We often use the following calculation to determine the amount of shmem
pages:

shmem = NR_ACTIVE_ANON + NR_INACTIVE_ANON - NR_ANON_PAGES

however the expression does not consider isolated and mlocked pages.

This patch adds explicit accounting for pages used by shmem and tmpfs.

Signed-off-by: KOSAKI Motohiro
Acked-by: Rik van Riel
Reviewed-by: Christoph Lameter
Acked-by: Wu Fengguang
Cc: David Rientjes
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-09-22 22:17:27 +0800

16 Sep, 2009

1 commit

6a46079cf HWPOISON: The high level memory error handler in the VM v7 ... Browse Code »

Add the high level memory handler that poisons pages
that got corrupted by hardware (typically by a two bit flip in a DIMM
or a cache) on the Linux level. The goal is to prevent everyone
from accessing these pages in the future.

This done at the VM level by marking a page hwpoisoned
and doing the appropriate action based on the type of page
it is.

The code that does this is portable and lives in mm/memory-failure.c

To quote the overview comment:

High level machine check handler. Handles pages reported by the
hardware as being corrupted usually due to a 2bit ECC memory or cache
failure.

This focuses on pages detected as corrupted in the background.
When the current CPU tries to consume corruption the currently
running process can just be killed directly instead. This implies
that if the error cannot be handled for some reason it's safe to
just ignore it because no corruption has been consumed yet. Instead
when that happens another machine check will happen.

Handles page cache pages in various states. The tricky part
here is that we can access any page asynchronous to other VM
users, because memory failures could happen anytime and anywhere,
possibly violating some of their assumptions. This is why this code
has to be extremely careful. Generally it tries to use normal locking
rules, as in get the standard locks, even if that means the
error handling takes potentially a long time.

Some of the operations here are somewhat inefficient and have non
linear algorithmic complexity, because the data structures have not
been optimized for this case. This is in particular the case
for the mapping from a vma to a process. Since this case is expected
to be rare we hope we can get away with this.

There are in principle two strategies to kill processes on poison:
- just unmap the data and wait for an actual reference before
killing
- kill as soon as corruption is detected.
Both have advantages and disadvantages and should be used
in different situations. Right now both are implemented and can
be switched with a new sysctl vm.memory_failure_early_kill
The default is early kill.

The patch does some rmap data structure walking on its own to collect
processes to kill. This is unusual because normally all rmap data structure
knowledge is in rmap.c only. I put it here for now to keep
everything together and rmap knowledge has been seeping out anyways

Includes contributions from Johannes Weiner, Chris Mason, Fengguang Wu,
Nick Piggin (who did a lot of great work) and others.

Cc: npiggin@suse.de
Cc: riel@redhat.com
Signed-off-by: Andi Kleen
Acked-by: Rik van Riel
Reviewed-by: Hidehiro Kawai

Andi Kleen
2009-09-16 17:50:15 +0800

14 Sep, 2009

6 commits

18f2ee705 vfs: Remove generic_osync_inode() and sync_page_range{_nolock}() ... Browse Code »

Remove these three functions since nobody uses them anymore.

Signed-off-by: Jan Kara

Jan Kara
2009-09-14 23:08:17 +0800
148f948ba vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode ... Browse Code »

Introduce new function for generic inode syncing (vfs_fsync_range) and use
it from fsync() path. Introduce also new helper for syncing after a sync
write (generic_write_sync) using the generic function.

Use these new helpers for syncing from generic VFS functions. This makes
O_SYNC writes to block devices acquire i_mutex for syncing. If we really
care about this, we can make block_fsync() drop the i_mutex and reacquire
it before it returns.

CC: Evgeniy Polyakov
CC: ocfs2-devel@oss.oracle.com
CC: Joel Becker
CC: Felix Blyakher
CC: xfs@oss.sgi.com
CC: Anton Altaparmakov
CC: linux-ntfs-dev@lists.sourceforge.net
CC: OGAWA Hirofumi
CC: linux-ext4@vger.kernel.org
CC: tytso@mit.edu
Acked-by: Christoph Hellwig
Signed-off-by: Jan Kara

Jan Kara
2009-09-14 23:08:15 +0800
eef993806 vfs: Rename generic_file_aio_write_nolock ... Browse Code »

generic_file_aio_write_nolock() is now used only by block devices and raw
character device. Filesystems should use __generic_file_aio_write() in case
generic_file_aio_write() doesn't suit them. So rename the function to
blkdev_aio_write() and move it to fs/blockdev.c.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jan Kara

Christoph Hellwig
2009-09-14 23:08:15 +0800
c7b50db21 vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write() ... Browse Code »

generic_file_direct_write() and generic_file_buffered_write() called
generic_osync_inode() if it was called on O_SYNC file or IS_SYNC inode. But
this is superfluous since generic_file_aio_write() does the syncing as well.
Also XFS and OCFS2 which call these functions directly handle syncing
themselves. So let's have a single place where syncing happens:
generic_file_aio_write().

We slightly change the behavior by syncing only the range of file to which the
write happened for buffered writes but that should be all that is required.

CC: ocfs2-devel@oss.oracle.com
CC: Joel Becker
CC: Felix Blyakher
CC: xfs@oss.sgi.com
Signed-off-by: Jan Kara

Jan Kara
2009-09-14 23:08:15 +0800
e4dd9de3c vfs: Export __generic_file_aio_write() and add some comments ... Browse Code »

Rename __generic_file_aio_write_nolock() to __generic_file_aio_write(), add
comments to write helpers explaining how they should be used and export
__generic_file_aio_write() since it will be used by some filesystems.

CC: ocfs2-devel@oss.oracle.com
CC: Joel Becker
Acked-by: Evgeniy Polyakov
Reviewed-by: Christoph Hellwig
Signed-off-by: Jan Kara

Jan Kara
2009-09-14 23:08:14 +0800
d3bccb6f4 vfs: Introduce filemap_fdatawait_range ... Browse Code »

This simple helper saves some filesystems conversion from byte offset
to page numbers and also makes the fdata* interface more complete.

Reviewed-by: Christoph Hellwig
Signed-off-by: Jan Kara

Jan Kara
2009-09-14 23:08:14 +0800

07 Jul, 2009

1 commit

c8236db9c mm: mark page accessed before we write_end() ... Browse Code »

In testing a backport of the write_begin/write_end AOPs, a 10% re-read
regression was noticed when running iozone. This regression was
introduced because the old AOPs would always do a mark_page_accessed(page)
after the commit_write, but when the new AOPs where introduced, the only
place this was kept was in pagecache_write_end().

This patch does the same thing in the generic case as what is done in
pagecache_write_end(), which is just to mark the page accessed before we
do write_end().

Signed-off-by: Josef Bacik
Acked-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Josef Bacik
2009-07-07 04:57:03 +0800