Doug / smarc-fsl-linux-kernel | Embedian Git Server

09 Oct, 2012

2 commits

45cac65b0 readahead: fault retry breaks mmap file read random detection ... Browse Code »

.fault now can retry. The retry can break state machine of .fault. In
filemap_fault, if page is miss, ra->mmap_miss is increased. In the second
try, since the page is in page cache now, ra->mmap_miss is decreased. And
these are done in one fault, so we can't detect random mmap file access.

Add a new flag to indicate .fault is tried once. In the second try, skip
ra->mmap_miss decreasing. The filemap_fault state machine is ok with it.

I only tested x86, didn't test other archs, but looks the change for other
archs is obvious, but who knows :)

Signed-off-by: Shaohua Li
Cc: Rik van Riel
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2012-10-09 15:22:47 +0800
0b173bc4d mm: kill vma flag VM_CAN_NONLINEAR ... Browse Code »

Move actual pte filling for non-linear file mappings into the new special
vma operation: ->remap_pages().

Filesystems must implement this method to get non-linear mapping support,
if it uses filemap_fault() then generic_file_remap_pages() can be used.

Now device drivers can implement this method and obtain nonlinear vma support.

Signed-off-by: Konstantin Khlebnikov
Cc: Alexander Viro
Cc: Carsten Otte
Cc: Chris Metcalf #arch/tile
Cc: Cyrill Gorcunov
Cc: Eric Paris
Cc: H. Peter Anvin
Cc: Hugh Dickins
Cc: Ingo Molnar
Cc: James Morris
Cc: Jason Baron
Cc: Kentaro Takeda
Cc: Matt Helsley
Cc: Nick Piggin
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Robert Richter
Cc: Suresh Siddha
Cc: Tetsuo Handa
Cc: Venkatesh Pallipadi
Acked-by: Linus Torvalds
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-10-09 15:22:17 +0800

09 Aug, 2012

2 commits

647d1e4c5 block: move down direct IO plugging ... Browse Code »

Move unplugging for direct I/O from around ->direct_IO() down to
do_blockdev_direct_IO(). This implicitly adds plugging for direct
writes.

CC: Li Shaohua
Acked-by: Jeff Moyer
Signed-off-by: Wu Fengguang
Signed-off-by: Jens Axboe

Fengguang Wu
2012-08-09 21:23:09 +0800
8430f9772 block: remove plugging at buffered write time ... Browse Code »

Buffered write(2) is not directly tied to IO, so it's not suitable to
handle plug in generic_file_aio_write().

Note that plugging for O_SYNC writes is also removed. The user may pass
arbitrary @size arguments, which may be much larger than the preferable
I/O size, or may cross extent/device boundaries. Let the lower layers
handle the plugging. The plugging code here actually turns them into
no-ops.

CC: Li Shaohua
Signed-off-by: Wu Fengguang
Signed-off-by: Jens Axboe

Fengguang Wu
2012-08-09 21:23:07 +0800

31 Jul, 2012

2 commits

14da92001 fs: Protect write paths by sb_start_write - sb_end_write ... Browse Code »

There are several entry points which dirty pages in a filesystem. mmap
(handled by block_page_mkwrite()), buffered write (handled by
__generic_file_aio_write()), splice write (generic_file_splice_write),
truncate, and fallocate (these can dirty last partial page - handled inside
each filesystem separately). Protect these places with sb_start_write() and
sb_end_write().

->page_mkwrite() calls are particularly complex since they are called with
mmap_sem held and thus we cannot use standard sb_start_write() due to lock
ordering constraints. We solve the problem by using a special freeze protection
sb_start_pagefault() which ranks below mmap_sem.

BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa
Tested-by: Peter M. Petrakis
Tested-by: Dann Frazier
Tested-by: Massimo Morana
Signed-off-by: Jan Kara
Signed-off-by: Al Viro

Jan Kara
2012-07-31 13:45:47 +0800
4fcf1c620 mm: Make default vm_ops provide ->page_mkwrite handler ... Browse Code »

Make default vm_ops provide ->page_mkwrite handler. Currently it only updates
file's modification times and gets locked page but later it will also handle
filesystem freezing.

BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa
Tested-by: Peter M. Petrakis
Tested-by: Dann Frazier
Tested-by: Massimo Morana
Signed-off-by: Jan Kara
Signed-off-by: Al Viro

Jan Kara
2012-07-31 05:02:48 +0800

02 Jun, 2012

2 commits

1193755ac Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs changes from Al Viro.
"A lot of misc stuff. The obvious groups:
* Miklos' atomic_open series; kills the damn abuse of
->d_revalidate() by NFS, which was the major stumbling block for
all work in that area.
* ripping security_file_mmap() and dealing with deadlocks in the
area; sanitizing the neighborhood of vm_mmap()/vm_munmap() in
general.
* ->encode_fh() switched to saner API; insane fake dentry in
mm/cleancache.c gone.
* assorted annotations in fs (endianness, __user)
* parts of Artem's ->s_dirty work (jff2 and reiserfs parts)
* ->update_time() work from Josef.
* other bits and pieces all over the place.

Normally it would've been in two or three pull requests, but
signal.git stuff had eaten a lot of time during this cycle ;-/"

Fix up trivial conflicts in Documentation/filesystems/vfs.txt (the
'truncate_range' inode method was removed by the VM changes, the VFS
update adds an 'update_time()' method), and in fs/btrfs/ulist.[ch] (due
to sparse fix added twice, with other changes nearby).

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (95 commits)
nfs: don't open in ->d_revalidate
vfs: retry last component if opening stale dentry
vfs: nameidata_to_filp(): don't throw away file on error
vfs: nameidata_to_filp(): inline __dentry_open()
vfs: do_dentry_open(): don't put filp
vfs: split __dentry_open()
vfs: do_last() common post lookup
vfs: do_last(): add audit_inode before open
vfs: do_last(): only return EISDIR for O_CREAT
vfs: do_last(): check LOOKUP_DIRECTORY
vfs: do_last(): make ENOENT exit RCU safe
vfs: make follow_link check RCU safe
vfs: do_last(): use inode variable
vfs: do_last(): inline walk_component()
vfs: do_last(): make exit RCU safe
vfs: split do_lookup()
Btrfs: move over to use ->update_time
fs: introduce inode operation ->update_time
reiserfs: get rid of resierfs_sync_super
reiserfs: mark the superblock as dirty a bit later
...

Linus Torvalds
2012-06-02 01:34:35 +0800
c3b2da314 fs: introduce inode operation ->update_time ... Browse Code »

Btrfs has to make sure we have space to allocate new blocks in order to modify
the inode, so updating time can fail. We've gotten around this by having our
own file_update_time but this is kind of a pain, and Christoph has indicated he
would like to make xfs do something different with atime updates. So introduce
->update_time, where we will deal with i_version an a/m/c time updates and
indicate which changes need to be made. The normal version just does what it
has always done, updates the time and marks the inode dirty, and then
filesystems can choose to do something different.

I've gone through all of the users of file_update_time and made them check for
errors with the exception of the fault code since it's complicated and I wasn't
quite sure what to do there, also Jan is going to be pushing the file time
updates into page_mkwrite for those who have it so that should satisfy btrfs and
make it not a big deal to check the file_update_time() return code in the
generic fault path. Thanks,

Signed-off-by: Josef Bacik

Josef Bacik
2012-06-02 00:07:25 +0800

31 May, 2012

1 commit

3ed37648e fs: move file_remove_suid() to fs/inode.c ... Browse Code »

file_remove_suid() is a generic function operates on struct file,
it almost has no relations with file mapping, so move it to fs/inode.c.

Cc: Alexander Viro
Signed-off-by: Cong Wang
Signed-off-by: Al Viro

Cong Wang
2012-05-31 09:04:52 +0800

30 May, 2012

1 commit

782182e53 mm: move readahead syscall to mm/readahead.c ... Browse Code »

It is better to define readahead(2) in mm/readahead.c than in
mm/filemap.c.

Signed-off-by: Cong Wang
Cc: Fengguang Wu
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cong Wang
2012-05-30 07:22:23 +0800

29 Mar, 2012

1 commit

0fc9d1040 radix-tree: use iterators in find_get_pages* functions ... Browse Code »

Replace radix_tree_gang_lookup_slot() and
radix_tree_gang_lookup_tag_slot() in page-cache lookup functions with
brand-new radix-tree direct iterating. This avoids the double-scanning
and pointer copying.

Iterator don't stop after nr_pages page-get fails in a row, it continue
lookup till the radix-tree end. Thus we can safely remove these restart
conditions.

Unfortunately, old implementation didn't forbid nr_pages == 0, this corner
case does not fit into new code, so the patch adds an extra check at the
beginning.

Signed-off-by: Konstantin Khlebnikov
Tested-by: Hugh Dickins
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-03-29 08:14:37 +0800

23 Mar, 2012

1 commit

aab008db8 Merge tag 'stable/for-linus-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm ... Browse Code »

Pull cleancache changes from Konrad Rzeszutek Wilk:
"This has some patches for the cleancache API that should have been
submitted a _long_ time ago. They are basically cleanups:

- rename of flush to invalidate

- moving reporting of statistics into debugfs

- use __read_mostly as necessary.

Oh, and also the MAINTAINERS file change. The files (except the
MAINTAINERS file) have been in #linux-next for months now. The late
addition of MAINTAINERS file is a brain-fart on my side - didn't
realize I needed that just until I was typing this up - and I based
that patch on v3.3 - so the tree is on top of v3.3."

* tag 'stable/for-linus-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
MAINTAINERS: Adding cleancache API to the list.
mm: cleancache: Use __read_mostly as appropiate.
mm: cleancache: report statistics via debugfs instead of sysfs.
mm: zcache/tmem/cleancache: s/flush/invalidate/
mm: cleancache: s/flush/invalidate/

Linus Torvalds
2012-03-23 10:52:47 +0800

22 Mar, 2012

3 commits

cc9a6c877 cpuset: mm: reduce large amounts of memory barrier related damage v3 ... Browse Code »

Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.

[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.

For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.

This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.

While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.

In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were

3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87

The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.

For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.

To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.

Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.

Signed-off-by: Mel Gorman
Cc: Miao Xie
Cc: David Rientjes
Cc: Peter Zijlstra
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-03-22 08:54:59 +0800
9a3c531df mm: update stale lock ordering comment for memory-failure.c ... Browse Code »

When i_mmap_lock changed to a mutex the locking order in memory failure
was changed to take the sleeping lock first. But the big fat mm lock
ordering comment (BFMLO) wasn't updated. Do this here.

Pointed out by Andrew.

Signed-off-by: Andi Kleen
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2012-03-22 08:54:58 +0800
1010bb1b8 mm: don't set __GFP_WRITE on ramfs/sysfs writes ... Browse Code »

There is not much point in skipping zones during allocation based on the
dirty usage which they'll never contribute to. And we'd like to avoid
page reclaim waits when writing to ramfs/sysfs etc.

Signed-off-by: Fengguang Wu
Acked-by: Johannes Weiner
Cc: Jan Kara
Cc: Greg Thelen
Cc: Ying Han
Cc: KAMEZAWA Hiroyuki
Cc: Rik van Riel
Cc: Mel Gorman
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fengguang Wu
2012-03-22 08:54:58 +0800

20 Mar, 2012

2 commits

9b04c5fec mm: remove the second argument of k[un]map_atomic() ... Browse Code »

Signed-off-by: Cong Wang

Cong Wang
2012-03-20 21:48:27 +0800
16c0cfa42 Merge branch 'stable/cleancache.v13' into linux-next ... Browse Code »

* stable/cleancache.v13:
mm: cleancache: Use __read_mostly as appropiate.
mm: cleancache: report statistics via debugfs instead of sysfs.
mm: zcache/tmem/cleancache: s/flush/invalidate/
mm: cleancache: s/flush/invalidate/

Konrad Rzeszutek Wilk
2012-03-20 00:12:19 +0800

04 Feb, 2012

1 commit

3deaa7190 readahead: fix pipeline break caused by block plug ... Browse Code »

Herbert Poetzl reported a performance regression since 2.6.39. The test
is a simple dd read, but with big block size. The reason is:

T1: ra (A, A+128k), (A+128k, A+256k)
T2: lock_page for page A, submit the 256k
T3: hit page A+128K, ra (A+256k, A+384). the range isn't submitted
because of plug and there isn't any lock_page till we hit page A+256k
because all pages from A to A+256k is in memory
T4: hit page A+256k, ra (A+384, A+ 512). Because of plug, the range isn't
submitted again.
T5: lock_page A+256k, so (A+256k, A+512k) will be submitted. The task is
waitting for (A+256k, A+512k) finish.

There is no request to disk in T3 and T4, so readahead pipeline breaks.

We really don't need block plug for generic_file_aio_read() for buffered
I/O. The readahead already has plug and has fine grained control when I/O
should be submitted. Deleting plug for buffered I/O fixes the regression.

One side effect is plug makes the request size 256k, the size is 128k
without it. This is because default ra size is 128k and not a reason we
need plug here.

Vivek said:

: We submit some readahead IO to device request queue but because of nested
: plug, queue never gets unplugged. When read logic reaches a page which is
: not in page cache, it waits for page to be read from the disk
: (lock_page_killable()) and that time we flush the plug list.
:
: So effectively read ahead logic is kind of broken in parts because of
: nested plugging. Removing top level plug (generic_file_aio_read()) for
: buffered reads, will allow unplugging queue earlier for readahead.

Signed-off-by: Shaohua Li
Signed-off-by: Wu Fengguang
Reported-by: Herbert Poetzl
Tested-by: Eric Dumazet
Cc: Christoph Hellwig
Cc: Jens Axboe
Cc: Vivek Goyal
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2012-02-04 08:16:41 +0800

24 Jan, 2012

1 commit

3167760f8 mm: cleancache: s/flush/invalidate/ ... Browse Code »

Per akpm suggestions alter the use of the term flush to be
invalidate. The next patch will do this across all MM.

This change is completely cosmetic.

[v9: akpm@linux-foundation.org: change "flush" to "invalidate", part 3]

Signed-off-by: Dan Magenheimer
Cc: Kamezawa Hiroyuki
Cc: Jan Beulich
Reviewed-by: Seth Jennings
Cc: Jeremy Fitzhardinge
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Nitin Gupta
Cc: Matthew Wilcox
Cc: Chris Mason
Cc: Rik Riel
Cc: Andrew Morton
[v10: Fixed fs: move code out of buffer.c conflict change]
Signed-off-by: Konrad Rzeszutek Wilk

Dan Magenheimer
2012-01-24 05:06:24 +0800

13 Jan, 2012

1 commit

ab936cbcd memcg: add mem_cgroup_replace_page_cache() to fix LRU issue ... Browse Code »

Commit ef6a3c6311 ("mm: add replace_page_cache_page() function") added a
function replace_page_cache_page(). This function replaces a page in the
radix-tree with a new page. WHen doing this, memory cgroup needs to fix
up the accounting information. memcg need to check PCG_USED bit etc.

In some(many?) cases, 'newpage' is on LRU before calling
replace_page_cache(). So, memcg's LRU accounting information should be
fixed, too.

This patch adds mem_cgroup_replace_page_cache() and removes the old hooks.
In that function, old pages will be unaccounted without touching
res_counter and new page will be accounted to the memcg (of old page).
WHen overwriting pc->mem_cgroup of newpage, take zone->lru_lock and avoid
races with LRU handling.

Background:
replace_page_cache_page() is called by FUSE code in its splice() handling.
Here, 'newpage' is replacing oldpage but this newpage is not a newly allocated
page and may be on LRU. LRU mis-accounting will be critical for memory cgroup
because rmdir() checks the whole LRU is empty and there is no account leak.
If a page is on the other LRU than it should be, rmdir() will fail.

This bug was added in March 2011, but no bug report yet. I guess there
are not many people who use memcg and FUSE at the same time with upstream
kernels.

The result of this bug is that admin cannot destroy a memcg because of
account leak. So, no panic, no deadlock. And, even if an active cgroup
exist, umount can succseed. So no problem at shutdown.

Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Miklos Szeredi
Cc: Hugh Dickins
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2012-01-13 12:13:04 +0800

11 Jan, 2012

1 commit

0faa70cb0 mm: filemap: pass __GFP_WRITE from grab_cache_page_write_begin() ... Browse Code »

Tell the page allocator that pages allocated through
grab_cache_page_write_begin() are expected to become dirty soon.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Acked-by: Mel Gorman
Reviewed-by: Minchan Kim
Reviewed-by: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Cc: Christoph Hellwig
Cc: Wu Fengguang
Cc: Dave Chinner
Cc: Jan Kara
Cc: Shaohua Li
Cc: Chris Mason
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-11 08:30:43 +0800

04 Jan, 2012

1 commit

649fc7b1b should_remove_suid(): inode->i_mode is umode_t ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:55:14 +0800

22 Dec, 2011

1 commit

e6f67b8c0 vfs: __read_cache_page should use gfp argument rather than GFP_KERNEL ... Browse Code »

lockdep reports a deadlock in jfs because a special inode's rw semaphore
is taken recursively. The mapping's gfp mask is GFP_NOFS, but is not
used when __read_cache_page() calls add_to_page_cache_lru().

Signed-off-by: Dave Kleikamp
Acked-by: Hugh Dickins
Acked-by: Al Viro
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Dave Kleikamp
2011-12-22 09:02:46 +0800

02 Dec, 2011

1 commit

a50527b19 fs: Make write(2) interruptible by a fatal signal ... Browse Code »

Currently write(2) to a file is not interruptible by any signal.
Sometimes this is desirable, e.g. when you want to quickly kill a
process hogging your disk. Also, with commit 499d05ecf990 ("mm: Make
task in balance_dirty_pages() killable"), it's necessary to abort the
current write accordingly to avoid it quickly dirtying lots more pages
at unthrottled rate.

This patch makes write interruptible by SIGKILL. We do not allow write
to be interruptible by any other signal because that has larger
potential of screwing some badly written applications.

Reported-by: Kazuya Mio
Tested-by: Kazuya Mio
Acked-by: Matthew Wilcox
Signed-off-by: Jan Kara
Signed-off-by: Wu Fengguang

Jan Kara
2011-12-02 09:17:05 +0800

31 Oct, 2011

1 commit

b95f1b31b mm: Map most files to use export.h instead of module.h ... Browse Code »

The files changed within are only using the EXPORT_SYMBOL
macro variants. They are not using core modular infrastructure
and hence don't need module.h but only the export.h header.

Signed-off-by: Paul Gortmaker

Paul Gortmaker
2011-10-31 21:20:12 +0800

28 Oct, 2011

1 commit

39be79c16 vfs: iov_iter: have iov_iter_advance decrement nr_segs appropriately ... Browse Code »

Currently, when you call iov_iter_advance, then the pointer to the iovec
array can be incremented, but it does not decrement the nr_segs value in
the iov_iter struct. The result is a iov_iter struct with a nr_segs
value that goes beyond the end of the array.

While I'm not aware of anything that's specifically broken by this, it
seems odd and a bit dangerous not to decrement that value. If someone
were to trust the nr_segs value to be correct, then they could end up
walking off the end of the array.

Changing this might also provide some micro-optimization when dealing
with the last iovec in an array. Many of the other routines that deal
with iov_iter have optimized codepaths when nr_segs == 1.

Cc: Nick Piggin
Signed-off-by: Jeff Layton
Signed-off-by: Christoph Hellwig

Jeff Layton
2011-10-28 19:55:08 +0800

15 Sep, 2011

1 commit

cc39c6a9b mm: account skipped entries to avoid looping in find_get_pages ... Browse Code »

The found entries by find_get_pages() could be all swap entries. In
this case we skip the entries, but make sure the skipped entries are
accounted, so we don't keep looping.

Using nr_found > nr_skip to simplify code as suggested by Eric.

Reported-and-tested-by: Eric Dumazet
Signed-off-by: Shaohua Li
Signed-off-by: Linus Torvalds

Shaohua Li
2011-09-15 09:17:56 +0800

04 Aug, 2011

4 commits

8079b1c85 mm: clarify the radix_tree exceptional cases ... Browse Code »

Make the radix_tree exceptional cases, mostly in filemap.c, clearer.

It's hard to devise a suitable snappy name that illuminates the use by
shmem/tmpfs for swap, while keeping filemap/pagecache/radix_tree
generality. And akpm points out that /* radix_tree_deref_retry(page) */
comments look like calls that have been commented out for unknown
reason.

Skirt the naming difficulty by rearranging these blocks to handle the
transient radix_tree_deref_retry(page) case first; then just explain the
remaining shmem/tmpfs swap case in a comment.

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-08-04 08:25:24 +0800
31475dd61 mm: a few small updates for radix-swap ... Browse Code »

Remove PageSwapBacked (!page_is_file_cache) cases from
add_to_page_cache_locked() and add_to_page_cache_lru(): those pages now
go through shmem_add_to_page_cache().

Remove a comment on maximum tmpfs size from fsstack_copy_inode_size(),
and add a comment on swap entries to invalidate_mapping_pages().

And mincore_page() uses find_get_page() on what might be shmem or a
tmpfs file: allow for a radix_tree_exceptional_entry(), and proceed to
find_get_page() on swapper_space if so (oh, swapper_space needs #ifdef).

Signed-off-by: Hugh Dickins
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-08-04 08:25:24 +0800
a2c16d6cb mm: let swap use exceptional entries ... Browse Code »

If swap entries are to be stored along with struct page pointers in a
radix tree, they need to be distinguished as exceptional entries.

Most of the handling of swap entries in radix tree will be contained in
shmem.c, but a few functions in filemap.c's common code need to check
for their appearance: find_get_page(), find_lock_page(),
find_get_pages() and find_get_pages_contig().

So as not to slow their fast paths, tuck those checks inside the
existing checks for unlikely radix_tree_deref_slot(); except for
find_lock_page(), where it is an added test. And make it a BUG in
find_get_pages_tag(), which is not applied to tmpfs files.

A part of the reason for eliminating shmem_readpage() earlier, was to
minimize the places where common code would need to allow for swap
entries.

The swp_entry_t known to swapfile.c must be massaged into a slightly
different form when stored in the radix tree, just as it gets massaged
into a pte_t when stored in page tables.

In an i386 kernel this limits its information (type and page offset) to
30 bits: given 32 "types" of swapfile and 4kB pagesize, that's a maximum
swapfile size of 128GB. Which is less than the 512GB we previously
allowed with X86_PAE (where the swap entry can occupy the entire upper
32 bits of a pte_t), but not a new limitation on 32-bit without PAE; and
there's not a new limitation on 64-bit (where swap filesize is already
limited to 16TB by a 32-bit page offset). Thirty areas of 128GB is
probably still enough swap for a 64GB 32-bit machine.

Provide swp_to_radix_entry() and radix_to_swp_entry() conversions, and
enforce filesize limit in read_swap_header(), just as for ptes.

Signed-off-by: Hugh Dickins
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-08-04 08:25:22 +0800
6328650bb radix_tree: exceptional entries and indices ... Browse Code »

A patchset to extend tmpfs to MAX_LFS_FILESIZE by abandoning its
peculiar swap vector, instead keeping a file's swap entries in the same
radix tree as its struct page pointers: thus saving memory, and
simplifying its code and locking.

This patch:

The radix_tree is used by several subsystems for different purposes. A
major use is to store the struct page pointers of a file's pagecache for
memory management. But what if mm wanted to store something other than
page pointers there too?

The low bit of a radix_tree entry is already used to denote an indirect
pointer, for internal use, and the unlikely radix_tree_deref_retry()
case.

Define the next bit as denoting an exceptional entry, and supply inline
functions radix_tree_exception() to return non-0 in either unlikely
case, and radix_tree_exceptional_entry() to return non-0 in the second
case.

If a subsystem already uses radix_tree with that bit set, no problem: it
does not affect internal workings at all, but is defined for the
convenience of those storing well-aligned pointers in the radix_tree.

The radix_tree_gang_lookups have an implicit assumption that the caller
can deduce the offset of each entry returned e.g. by the page->index of
a struct page. But that may not be feasible for some kinds of item to
be stored there.

radix_tree_gang_lookup_slot() allow for an optional indices argument,
output array in which to return those offsets. The same could be added
to other radix_tree_gang_lookups, but for now keep it to the only one
for which we need it.

Signed-off-by: Hugh Dickins
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-08-04 08:25:22 +0800

27 Jul, 2011

1 commit

f01ef569c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: (27 commits)
mm: properly reflect task dirty limits in dirty_exceeded logic
writeback: don't busy retry writeback on new/freeing inodes
writeback: scale IO chunk size up to half device bandwidth
writeback: trace global_dirty_state
writeback: introduce max-pause and pass-good dirty limits
writeback: introduce smoothed global dirty limit
writeback: consolidate variable names in balance_dirty_pages()
writeback: show bdi write bandwidth in debugfs
writeback: bdi write bandwidth estimation
writeback: account per-bdi accumulated written pages
writeback: make writeback_control.nr_to_write straight
writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr()
writeback: trace event writeback_queue_io
writeback: trace event writeback_single_inode
writeback: remove .nonblocking and .encountered_congestion
writeback: remove writeback_control.more_io
writeback: skip balance_dirty_pages() for in-memory fs
writeback: add bdi_dirty_limit() kernel-doc
writeback: avoid extra sync work at enqueue time
writeback: elevate queue_io() into wb_writeback()
...

Fix up trivial conflicts in fs/fs-writeback.c and mm/filemap.c

Linus Torvalds
2011-07-27 01:39:54 +0800

26 Jul, 2011

2 commits

b85e0effd mm: consistent truncate and invalidate loops ... Browse Code »

Make the pagevec_lookup loops in truncate_inode_pages_range(),
invalidate_mapping_pages() and invalidate_inode_pages2_range() more
consistent with each other.

They were relying upon page->index of an unlocked page, but apologizing
for it: accept it, embrace it, add comments and WARN_ONs, and simplify the
index handling.

invalidate_inode_pages2_range() had special handling for a wrapped
page->index + 1 = 0 case; but MAX_LFS_FILESIZE doesn't let us anywhere
near there, and a corrupt page->index in the radix_tree could cause more
trouble than that would catch. Remove that wrapped handling.

invalidate_inode_pages2_range() uses min() to limit the pagevec_lookup
when near the end of the range: copy that into the other two, although
it's less useful than you might think (it limits the use of the buffer,
rather than the indices looked up).

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-07-26 11:57:10 +0800
5e5358e7c mm: cleanup descriptions of filler arg ... Browse Code »

The often-NULL data arg to read_cache_page() and read_mapping_page()
functions is misdescribed as "destination for read data": no, it's the
first arg to the filler function, often struct file * to ->readpage().

Satisfy checkpatch.pl on those filler prototypes, and tidy up the
declarations in linux/pagemap.h.

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-07-26 11:57:10 +0800

21 Jul, 2011

1 commit

bd5fe6c5e fs: kill i_alloc_sem ... Browse Code »

i_alloc_sem is a rather special rw_semaphore. It's the last one that may
be released by a non-owner, and it's write side is always mirrored by
real exclusion. It's intended use it to wait for all pending direct I/O
requests to finish before starting a truncate.

Replace it with a hand-grown construct:

- exclusion for truncates is already guaranteed by i_mutex, so it can
simply fall way
- the reader side is replaced by an i_dio_count member in struct inode
that counts the number of pending direct I/O requests. Truncate can't
proceed as long as it's non-zero
- when i_dio_count reaches non-zero we wake up a pending truncate using
wake_up_bit on a new bit in i_flags
- new references to i_dio_count can't appear while we are waiting for
it to read zero because the direct I/O count always needs i_mutex
(or an equivalent like XFS's i_iolock) for starting a new operation.

This scheme is much simpler, and saves the space of a spinlock_t and a
struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
system).

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2011-07-21 08:47:46 +0800

08 Jun, 2011

1 commit

f758eeabe writeback: split inode_wb_list_lock into bdi_writeback.list_lock ... Browse Code »

Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
as it's currently the most contended lock in the system for metadata
heavy workloads. It won't help for single-filesystem workloads for
which we'll need the I/O-less balance_dirty_pages, but at least we
can dedicate a cpu to spinning on each bdi now for larger systems.

Based on earlier patches from Nick Piggin and Dave Chinner.

It reduces lock contentions to 1/4 in this test case:
10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram

lock_stat version 0.3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
vanilla 2.6.39-rc3:
inode_wb_list_lock: 42590 44433 0.12 147.74 144127.35 252274 886792 0.08 121.34 917211.23
------------------
inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
inode_wb_list_lock 34 [] inode_wb_list_del+0x22/0x49
inode_wb_list_lock 12893 [] __mark_inode_dirty+0x170/0x1d0
inode_wb_list_lock 10702 [] writeback_single_inode+0x16d/0x20a
------------------
inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
inode_wb_list_lock 19 [] inode_wb_list_del+0x22/0x49
inode_wb_list_lock 5550 [] __mark_inode_dirty+0x170/0x1d0
inode_wb_list_lock 8511 [] writeback_sb_inodes+0x10f/0x157

2.6.39-rc3 + patch:
&(&wb->list_lock)->rlock: 11383 11657 0.14 151.69 40429.51 90825 527918 0.11 145.90 556843.37
------------------------
&(&wb->list_lock)->rlock 10 [] inode_wb_list_del+0x5f/0x86
&(&wb->list_lock)->rlock 1493 [] writeback_inodes_wb+0x3d/0x150
&(&wb->list_lock)->rlock 3652 [] writeback_sb_inodes+0x123/0x16f
&(&wb->list_lock)->rlock 1412 [] writeback_single_inode+0x17f/0x223
------------------------
&(&wb->list_lock)->rlock 3 [] bdi_lock_two+0x46/0x4b
&(&wb->list_lock)->rlock 6 [] inode_wb_list_del+0x5f/0x86
&(&wb->list_lock)->rlock 2061 [] __mark_inode_dirty+0x173/0x1cf
&(&wb->list_lock)->rlock 2629 [] writeback_sb_inodes+0x123/0x16f

hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment

Signed-off-by: Christoph Hellwig
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Wu Fengguang

Christoph Hellwig
2011-06-08 08:25:21 +0800

04 Jun, 2011

1 commit

9e1f1de02 more conservative S_NOSEC handling ... Browse Code »

Caching "we have already removed suid/caps" was overenthusiastic as merged.
On network filesystems we might have had suid/caps set on another client,
silently picked by this client on revalidate, all of that *without* clearing
the S_NOSEC flag.

AFAICS, the only reasonably sane way to deal with that is
* new superblock flag; unless set, S_NOSEC is not going to be set.
* local block filesystems set it in their ->mount() (more accurately,
mount_bdev() does, so does btrfs ->mount(), users of mount_bdev() other than
local block ones clear it)
* if any network filesystem (or a cluster one) wants to use S_NOSEC,
it'll need to set MS_NOSEC in sb->s_flags *AND* take care to clear S_NOSEC when
inode attribute changes are picked from other clients.

It's not an earth-shattering hole (anybody that can set suid on another client
will almost certainly be able to write to the file before doing that anyway),
but it's a bug that needs fixing.

Signed-off-by: Al Viro

Al Viro
2011-06-04 06:24:58 +0800

29 May, 2011

1 commit

69b457329 Cache xattr security drop check for write v2 ... Browse Code »

Some recent benchmarking on btrfs showed that a major scaling bottleneck
on large systems on btrfs is currently the xattr lookup on every write.

Why xattr lookup on every write I hear you ask?

write wants to drop suid and security related xattrs that could set o
capabilities for executables. To do that it currently looks up
security.capability on EVERY write (even for non executables) to decide
whether to drop it or not.

In btrfs this causes an additional tree walk, hitting some per file system
locks and quite bad scalability. In a simple read workload on a 8S
system I saw over 90% CPU time in spinlocks related to that.

Chris Mason tells me this is also a problem in ext4, where it hits
the global mbcache lock.

This patch adds a simple per inode to avoid this problem. We only
do the lookup once per file and then if there is no xattr cache
the decision. All xattr changes clear the flag.

I also used the same flag to avoid the suid check, although
that one is pretty cheap.

A file system can also set this flag when it creates the inode,
if it has a cheap way to do so. This is done for some common file systems
in followon patches.

With this patch a major part of the lock contention disappears
for btrfs. Some testing on smaller systems didn't show significant
performance changes, but at least it helps the larger systems
and is generally more efficient.

v2: Rename is_sgid. add file system helper.
Cc: chris.mason@oracle.com
Cc: josef@redhat.com
Cc: viro@zeniv.linux.org.uk
Cc: agruen@linbit.com
Cc: Serge E. Hallyn
Signed-off-by: Andi Kleen
Signed-off-by: Al Viro

Andi Kleen
2011-05-29 00:02:09 +0800

28 May, 2011

1 commit

3d08bcc88 mm: Wait for writeback when grabbing pages to begin a write ... Browse Code »

When grabbing a page for a buffered IO write, the mm should wait for writeback
on the page to complete so that the page does not become writable during the IO
operation. This change is needed to provide page stability during writes for
all filesystems.

Signed-off-by: Darrick J. Wong
Signed-off-by: Al Viro

Darrick J. Wong
2011-05-28 13:03:21 +0800

27 May, 2011

1 commit

456f998ec memcg: add the pagefault count into memcg stats ... Browse Code »

Two new stats in per-memcg memory.stat which tracks the number of page
faults and number of major page faults.

"pgfault"
"pgmajfault"

They are different from "pgpgin"/"pgpgout" stat which count number of
pages charged/discharged to the cgroup and have no meaning of reading/
writing page to disk.

It is valuable to track the two stats for both measuring application's
performance as well as the efficiency of the kernel page reclaim path.
Counting pagefaults per process is useful, but we also need the aggregated
value since processes are monitored and controlled in cgroup basis in
memcg.

Functional test: check the total number of pgfault/pgmajfault of all
memcgs and compare with global vmstat value:

$ cat /proc/vmstat | grep fault
pgfault 1070751
pgmajfault 553

$ cat /dev/cgroup/memory.stat | grep fault
pgfault 1071138
pgmajfault 553
total_pgfault 1071142
total_pgmajfault 553

$ cat /dev/cgroup/A/memory.stat | grep fault
pgfault 199
pgmajfault 0
total_pgfault 199
total_pgmajfault 0

Performance test: run page fault test(pft) wit 16 thread on faulting in
15G anon pages in 16G container. There is no regression noticed on the
"flt/cpu/s"

Sample output from pft:

TAG pft:anon-sys-default:
Gb Thr CLine User System Wall flt/cpu/s fault/wsec
15 16 1 0.67s 233.41s 14.76s 16798.546 266356.260

+-------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 10 16682.962 17344.027 16913.524 16928.812 166.5362
+ 10 16695.568 16923.896 16820.604 16824.652 84.816568
No difference proven at 95.0% confidence

[akpm@linux-foundation.org: fix build]
[hughd@google.com: shmem fix]
Signed-off-by: Ying Han
Acked-by: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Reviewed-by: Minchan Kim
Cc: Daisuke Nishimura
Acked-by: Balbir Singh
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ying Han
2011-05-27 08:12:36 +0800