27 Aug, 2016
1 commit
-
For DAX inodes we need to be careful to never have page cache pages in
the mapping->page_tree. This radix tree should be composed only of DAX
exceptional entries and zero pages.ltp's readahead02 test was triggering a warning because we were trying
to insert a DAX exceptional entry but found that a page cache page had
already been inserted into the tree. This page was being inserted into
the radix tree in response to a readahead(2) call.Readahead doesn't make sense for DAX inodes, but we don't want it to
report a failure either. Instead, we just return success and don't do
any work.Link: http://lkml.kernel.org/r/20160824221429.21158-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler
Reported-by: Jeff Moyer
Cc: Dan Williams
Cc: Dave Chinner
Cc: Dave Hansen
Cc: Jan Kara
Cc: [4.5+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
27 Jul, 2016
1 commit
-
Vladimir has noticed that we might declare memcg oom even during
readahead because read_pages only uses GFP_KERNEL (with mapping_gfp
restriction) while __do_page_cache_readahead uses
page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from
OOMs. This gfp mask discrepancy is really unfortunate and easily
fixable. Drop page_cache_alloc_readahead() which only has one user and
outsource the gfp_mask logic into readahead_gfp_mask and propagate this
mask from __do_page_cache_readahead down to read_pages.This alone would have only very limited impact as most filesystems are
implementing ->readpages and the common implementation mpage_readpages
does GFP_KERNEL (with mapping_gfp restriction) again. We can tell it to
use readahead_gfp_mask instead as this function is called only during
readahead as well. The same applies to read_cache_pages.ext4 has its own ext4_mpage_readpages but the path which has pages !=
NULL can use the same gfp mask. Btrfs, cifs, f2fs and orangefs are
doing a very similar pattern to mpage_readpages so the same can be
applied to them as well.[akpm@linux-foundation.org: coding-style fixes]
[mhocko@suse.com: restrict gfp mask in mpage_alloc]
Link: http://lkml.kernel.org/r/20160610074223.GC32285@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/1465301556-26431-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko
Cc: Vladimir Davydov
Cc: Chris Mason
Cc: Steve French
Cc: Theodore Ts'o
Cc: Jan Kara
Cc: Mike Marshall
Cc: Jaegeuk Kim
Cc: Changman Lee
Cc: Chao Yu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
05 Apr, 2016
1 commit
-
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.Let's stop pretending that pages in page cache are special. They are
not.The changes are pretty straight-forward:
- << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;
- >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds
15 Jan, 2016
2 commits
-
Move lru_to_page() from internal.h to mm_inline.h.
Signed-off-by: Geliang Tang
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
list_to_page() in readahead.c is the same as lru_to_page() in vmscan.c.
So I move lru_to_page to internal.h and drop list_to_page().Signed-off-by: Geliang Tang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Nov, 2015
1 commit
-
There are many places which use mapping_gfp_mask to restrict a more
generic gfp mask which would be used for allocations which are not
directly related to the page cache but they are performed in the same
context.Let's introduce a helper function which makes the restriction explicit and
easier to track. This patch doesn't introduce any functional changes.[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko
Suggested-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
06 Nov, 2015
1 commit
-
Maximal readahead size is limited now by two values:
1) by global 2Mb constant (MAX_READAHEAD in max_sane_readahead())
2) by configurable per-device value* (bdi->ra_pages)There are devices, which require custom readahead limit.
For instance, for RAIDs it's calculated as number of devices
multiplied by chunk size times 2.Readahead size can never be larger than bdi->ra_pages * 2 value
(POSIX_FADV_SEQUNTIAL doubles readahead size).If so, why do we need two limits?
I suggest to completely remove this max_sane_readahead() stuff and
use per-device readahead limit everywhere.Also, using right readahead size for RAID disks can significantly
increase i/o performance:before:
dd if=/dev/md2 of=/dev/null bs=100M count=100
100+0 records in
100+0 records out
10485760000 bytes (10 GB) copied, 12.9741 s, 808 MB/safter:
$ dd if=/dev/md2 of=/dev/null bs=100M count=100
100+0 records in
100+0 records out
10485760000 bytes (10 GB) copied, 8.91317 s, 1.2 GB/s(It's an 8-disks RAID5 storage).
This patch doesn't change sys_readahead and madvise(MADV_WILLNEED)
behavior introduced by 6d2be915e589b58 ("mm/readahead.c: fix readahead
failure for memoryless NUMA nodes and limit readahead pages").Signed-off-by: Roman Gushchin
Cc: Raghavendra K T
Cc: Jan Kara
Cc: Wu Fengguang
Cc: David Rientjes
Cc: onstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
17 Oct, 2015
1 commit
-
Commit 6afdb859b710 ("mm: do not ignore mapping_gfp_mask in page cache
allocation paths") has caught some users of hardcoded GFP_KERNEL used in
the page cache allocation paths. This, however, wasn't complete and
there were others which went unnoticed.Dave Chinner has reported the following deadlock for xfs on loop device:
: With the recent merge of the loop device changes, I'm now seeing
: XFS deadlock on my single CPU, 1GB RAM VM running xfs/073.
:
: The deadlocked is as follows:
:
: kloopd1: loop_queue_read_work
: xfs_file_iter_read
: lock XFS inode XFS_IOLOCK_SHARED (on image file)
: page cache read (GFP_KERNEL)
: radix tree alloc
: memory reclaim
: reclaim XFS inodes
: log force to unpin inodes
:
:
: xfs-cil/loop1:
: xlog_cil_push
: xlog_write
:
: xlog_state_get_iclog_space()
:
:
:
: kloopd1: loop_queue_write_work
: xfs_file_write_iter
: lock XFS inode XFS_IOLOCK_EXCL (on image file)
:
:
: i.e. the kloopd, with it's split read and write work queues, has
: introduced a dependency through memory reclaim. i.e. that writes
: need to be able to progress for reads make progress.
:
: The problem, fundamentally, is that mpage_readpages() does a
: GFP_KERNEL allocation, rather than paying attention to the inode's
: mapping gfp mask, which is set to GFP_NOFS.
:
: The didn't used to happen, because the loop device used to issue
: reads through the splice path and that does:
:
: error = add_to_page_cache_lru(page, mapping, index,
: GFP_KERNEL & mapping_gfp_mask(mapping));This has changed by commit aa4d86163e4 ("block: loop: switch to VFS
ITER_BVEC").This patch changes mpage_readpage{s} to follow gfp mask set for the
mapping. There are, however, other places which are doing basically the
same.lustre:ll_dir_filler is doing GFP_KERNEL from the function which
apparently uses GFP_NOFS for other allocations so let's make this
consistent.cifs:readpages_get_pages is called from cifs_readpages and
__cifs_readpages_from_fscache called from the same path obeys mapping
gfp.ramfs_nommu_expand_for_mapping is hardcoding GFP_KERNEL as well
regardless it uses mapping_gfp_mask for the page allocation.ext4_mpage_readpages is the called from the page cache allocation path
same as read_pages and read_cache_pagesAs I've noticed in my previous post I cannot say I would be happy about
sprinkling mapping_gfp_mask all over the place and it sounds like we
should drop gfp_mask argument altogether and use it internally in
__add_to_page_cache_locked that would require all the filesystems to use
mapping gfp consistently which I am not sure is the case here. From a
quick glance it seems that some file system use it all the time while
others are selective.Signed-off-by: Michal Hocko
Reported-by: Dave Chinner
Cc: "Theodore Ts'o"
Cc: Ming Lei
Cc: Andreas Dilger
Cc: Oleg Drokin
Cc: Al Viro
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
02 Jun, 2015
1 commit
-
In several places, bdi_congested() and its wrappers are used to
determine whether more IOs should be issued. With cgroup writeback
support, this question can't be answered solely based on the bdi
(backing_dev_info). It's dependent on whether the filesystem and bdi
support cgroup writeback and the blkcg the inode is associated with.This patch implements inode_congested() and its wrappers which take
@inode and determines the congestion state considering cgroup
writeback. The new functions replace bdi_*congested() calls in places
where the query is about specific inode and task.There are several filesystem users which also fit this criteria but
they should be updated when each filesystem implements cgroup
writeback support.v2: Now that a given inode is associated with only one wb, congestion
state can be determined independent from the asking task. Drop
@task. Spotted by Vivek. Also, converted to take @inode instead
of @mapping and renamed to inode_congested().Signed-off-by: Tejun Heo
Cc: Jens Axboe
Cc: Jan Kara
Cc: Vivek Goyal
Signed-off-by: Jens Axboe
21 Jan, 2015
1 commit
-
Now that we got rid of the bdi abuse on character devices we can always use
sb->s_bdi to get at the backing_dev_info for a file, except for the block
device special case. Export inode_to_bdi and replace uses of
mapping->backing_dev_info with it to prepare for the removal of
mapping->backing_dev_info.Signed-off-by: Christoph Hellwig
Reviewed-by: Tejun Heo
Reviewed-by: Jan Kara
Signed-off-by: Jens Axboe
07 Aug, 2014
1 commit
-
count_history_pages does only call page_cache_prev_hole in rcu_lock
context using address_space mapping. There's no need to have
file_ra_state here.Signed-off-by: Fabian Frederick
Acked-by: Fengguang Wu
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
08 Apr, 2014
1 commit
-
Commit f9acc8c7b35a ("readahead: sanify file_ra_state names") left
ra_submit with a single function call.Move ra_submit to internal.h and inline it to save some stack. Thanks
to Andrew Morton for commenting different versions.Signed-off-by: Fabian Frederick
Suggested-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
04 Apr, 2014
3 commits
-
Currently max_sane_readahead() returns zero on the cpu whose NUMA node
has no local memory which leads to readahead failure. Fix this
readahead failure by returning minimum of (requested pages, 512). Users
running applications on a memory-less cpu which needs readahead such as
streaming application see considerable boost in the performance.Result:
fadvise experiment with FADV_WILLNEED on a PPC machine having memoryless
CPU with 1GB testfile (12 iterations) yielded around 46.66% improvement.fadvise experiment with FADV_WILLNEED on a x240 machine with 1GB
testfile 32GB* 4G RAM numa machine (12 iterations) showed no impact on
the normal NUMA cases w/ patch.Kernel Avg Stddev
base 7.4975 3.92%
patched 7.4174 3.26%[Andrew: making return value PAGE_SIZE independent]
Suggested-by: Linus Torvalds
Signed-off-by: Raghavendra K T
Acked-by: Jan Kara
Cc: Wu Fengguang
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
shmem mappings already contain exceptional entries where swap slot
information is remembered.To be able to store eviction information for regular page cache, prepare
every site dealing with the radix trees directly to handle entries other
than pages.The common lookup functions will filter out non-page entries and return
NULL for page cache holes, just as before. But provide a raw version of
the API which returns non-page entries as well, and switch shmem over to
use it.Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Cc: Andrea Arcangeli
Cc: Bob Liu
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Jan Kara
Cc: KOSAKI Motohiro
Cc: Luigi Semenzato
Cc: Mel Gorman
Cc: Metin Doslu
Cc: Michel Lespinasse
Cc: Ozgun Erdogan
Cc: Peter Zijlstra
Cc: Roman Gushchin
Cc: Ryan Mallon
Cc: Tejun Heo
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The radix tree hole searching code is only used for page cache, for
example the readahead code trying to get a a picture of the area
surrounding a fault.It sufficed to rely on the radix tree definition of holes, which is
"empty tree slot". But this is about to change, though, as shadow page
descriptors will be stored in the page cache after the actual pages get
evicted from memory.Move the functions over to mm/filemap.c and make them native page cache
operations, where they can later be adapted to handle the new definition
of "page cache hole".Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Acked-by: Mel Gorman
Cc: Andrea Arcangeli
Cc: Bob Liu
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Jan Kara
Cc: KOSAKI Motohiro
Cc: Luigi Semenzato
Cc: Metin Doslu
Cc: Michel Lespinasse
Cc: Ozgun Erdogan
Cc: Peter Zijlstra
Cc: Roman Gushchin
Cc: Ryan Mallon
Cc: Tejun Heo
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
30 Jan, 2014
1 commit
-
Commit 63d0f0a3c7e1 ("mm/readahead.c:do_readhead(): don't check for
->readpage") unintentionally made do_readahead return 0 for all valid
files regardless of whether readahead was supported, rather than the
expected -EINVAL. This gets forwarded on to userspace, and results in
sys_readahead appearing to succeed in cases that don't make sense (e.g.
when called on pipes or sockets). This issue is detected by the LTP
readahead01 testcase.As the exact return value of force_page_cache_readahead is currently
never used, we can simplify it to return only 0 or -EINVAL (when
readpage or readpages is missing). With that in place we can simply
forward on the return value of force_page_cache_readahead in
do_readahead.This patch performs said change, restoring the expected semantics.
Signed-off-by: Mark Rutland
Acked-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
13 Nov, 2013
2 commits
-
The kernel's readahead algorithm sometimes interprets random read
accesses as sequential and triggers unnecessary data prefecthing from
storage device (impacting random read average latency).In order to identify sequential cache read misses, the readahead
algorithm intends to check whether offset - previous offset == 1
(trivial sequential reads) or offset - previous offset == 0 (sequential
reads not aligned on page boundary):if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) current offset (which happens on random pattern), the if
condition is true and access is wrongly interpeted as sequential. An
unnecessary data prefetching is triggered, impacting the average random
read latency.Storing the previous offset value in a "pgoff_t" variable (unsigned
long) fixes the sequential read detection logic.Signed-off-by: Damien Ramonda
Reviewed-by: Fengguang Wu
Acked-by: Pierre Tardy
Acked-by: David Cohen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The callee force_page_cache_readahead() already does this and unlike
do_readahead(), force_page_cache_readahead() remembers to check for
->readpages() as well.Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Sep, 2013
1 commit
-
This helps performance on moderately dense random reads on SSD.
Transaction-Per-Second numbers provided by Taobao:
QPS case
-------------------------------------------------------
7536 disable context readahead totally
w/ patch: 7129 slower size rampup and start RA on the 3rd read
6717 slower size rampup
w/o patch: 5581 unmodified context readaheadBefore, readahead will be started whenever reading page N+1 when it happen
to read N recently. After patch, we'll only start readahead when *three*
random reads happen to access pages N, N+1, N+2. The probability of this
happening is extremely low for pure random reads, unless they are very
dense, which actually deserves some readahead.Also start with a smaller readahead window. The impact to interleaved
sequential reads should be small, because for a long run stream, the the
small readahead window rampup phase is negletable.The context readahead actually benefits clustered random reads on HDD
whose seek cost is pretty high. However as SSD is increasingly used for
random read workloads it's better for the context readahead to concentrate
on interleaved sequential reads.Another SSD rand read test from Miao
# file size: 2GB
# read IO amount: 625MB
sysbench --test=fileio \
--max-requests=10000 \
--num-threads=1 \
--file-num=1 \
--file-block-size=64K \
--file-test-mode=rndrd \
--file-fsync-freq=0 \
--file-fsync-end=off runshows the performance of btrfs grows up from 69MB/s to 121MB/s, ext4 from
104MB/s to 121MB/s.Signed-off-by: Wu Fengguang
Tested-by: Tao Ma
Tested-by: Miao Xie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
22 May, 2013
1 commit
-
Currently there is no way to truncate partial page where the end
truncate point is not at the end of the page. This is because it was not
needed and the functionality was enough for file system truncate
operation to work properly. However more file systems now support punch
hole feature and it can benefit from mm supporting truncating page just
up to the certain point.Specifically, with this functionality truncate_inode_pages_range() can
be changed so it supports truncating partial page at the end of the
range (currently it will BUG_ON() if 'end' is not at the end of the
page).This commit changes the invalidatepage() address space operation
prototype to accept range to be invalidated and update all the instances
for it.We also change the block_invalidatepage() in the same way and actually
make a use of the new length argument implementing range invalidation.Actual file system implementations will follow except the file systems
where the changes are really simple and should not change the behaviour
in any way .Implementation for truncate_page_range() which will be able
to accept page unaligned ranges will follow as well.Signed-off-by: Lukas Czerner
Cc: Andrew Morton
Cc: Hugh Dickins
04 Mar, 2013
1 commit
-
... and convert a bunch of SYSCALL_DEFINE ones to SYSCALL_DEFINE,
killing the boilerplate crap around them.Signed-off-by: Al Viro
27 Sep, 2012
2 commits
-
Signed-off-by: Al Viro
-
Signed-off-by: Al Viro
30 May, 2012
1 commit
-
It is better to define readahead(2) in mm/readahead.c than in
mm/filemap.c.Signed-off-by: Cong Wang
Cc: Fengguang Wu
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
31 Oct, 2011
1 commit
-
The files changed within are only using the EXPORT_SYMBOL
macro variants. They are not using core modular infrastructure
and hence don't need module.h but only the export.h header.Signed-off-by: Paul Gortmaker
25 May, 2011
1 commit
-
Pass __GFP_NORETRY|__GFP_NOWARN for readahead page allocations.
readahead page allocations are completely optional. They are OK to fail
and in particular shall not trigger OOM on themselves.Reported-by: Dave Young
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Wu Fengguang
Reviewed-by: Minchan Kim
Reviewed-by: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
10 Mar, 2011
2 commits
-
Signed-off-by: Jens Axboe
-
Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().Signed-off-by: Jens Axboe
25 May, 2010
1 commit
-
Fix a wrong comment over page_cache_async_readahead().
Signed-off-by: Huang Shijie
Acked-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Apr, 2010
1 commit
-
btrfs relocate_file_extent_cluster() calls us with NULL filp:
[ 4005.426805] BUG: unable to handle kernel NULL pointer dereference at 00000021
[ 4005.426818] IP: [] page_cache_sync_readahead+0x18/0x3eSigned-off-by: Wu Fengguang
Cc: Yan Zheng
Reported-by: Kirill A. Shutemov
Tested-by: Kirill A. Shutemov
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
30 Mar, 2010
1 commit
-
…it slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
07 Mar, 2010
1 commit
-
This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM.
POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor performance:
a 16K read will be carried out in 4 _sync_ 1-page reads.In other places, ra_pages==0 means
- it's ramfs/tmpfs/hugetlbfs/sysfs/configfs
- some IO error happened
where multi-page read IO won't help or should be avoided.POSIX_FADV_RANDOM actually want a different semantics: to disable the
*heuristic* readahead algorithm, and to use a dumb one which faithfully
submit read IO for whatever application requests.So introduce a flag FMODE_RANDOM for POSIX_FADV_RANDOM.
Note that the random hint is not likely to help random reads performance
noticeably. And it may be too permissive on huge request size (its IO
size is not limited by read_ahead_kb).In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overall
(NFS read) performance of the application increased by 313%!Tested-by: Quentin Barnes
Signed-off-by: Wu Fengguang
Cc: Nick Piggin
Cc: Andi Kleen
Cc: Steven Whitehouse
Cc: David Howells
Cc: Jonathan Corbet
Cc: Al Viro
Cc: Christoph Hellwig
Cc: Trond Myklebust
Cc: Chuck Lever
Cc: [2.6.33.x]
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
18 Dec, 2009
1 commit
-
I added blk_run_backing_dev on page_cache_async_readahead so readahead I/O
is unpluged to improve throughput on especially RAID environment.The normal case is, if page N become uptodate at time T(N), then T(N)
Acked-by: Wu Fengguang
Cc: Jens Axboe
Cc: KOSAKI Motohiro
Tested-by: Ronald
Cc: Bart Van Assche
Cc: Vladislav Bolkhovitin
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
17 Jun, 2009
7 commits
-
Introduce page cache context based readahead algorithm.
This is to better support concurrent read streams in general.RATIONALE
---------
The current readahead algorithm detects interleaved reads in a _passive_ way.
Given a sequence of interleaved streams 1,1001,2,1002,3,4,1003,5,1004,1005,6,...
By checking for (offset == prev_offset + 1), it will discover the sequentialness
between 3,4 and between 1004,1005, and start doing sequential readahead for the
individual streams since page 4 and page 1005.The context readahead algorithm guarantees to discover the sequentialness no
matter how the streams are interleaved. For the above example, it will start
sequential readahead since page 2 and 1002.The trick is to poke for page @offset-1 in the page cache when it has no other
clues on the sequentialness of request @offset: if the current requenst belongs
to a sequential stream, that stream must have accessed page @offset-1 recently,
and the page will still be cached now. So if page @offset-1 is there, we can
take request @offset as a sequential access.BENEFICIARIES
-------------
- strictly interleaved reads i.e. 1,1001,2,1002,3,1003,...
the current readahead will take them as silly random reads;
the context readahead will take them as two sequential streams.- cooperative IO processes i.e. NFS and SCST
They create a thread pool, farming off (sequential) IO requests to different
threads which will be performing interleaved IO.It was not easy(or possible) to reliably tell from file->f_ra all those
cooperative processes working on the same sequential stream, since they will
have different file->f_ra instances. And NFSD's file->f_ra is particularly
unusable, since their file objects are dynamically created for each request.
The nfsd does have code trying to restore the f_ra bits, but not satisfactory.The new scheme is to detect the sequential pattern via looking up the page
cache, which provides one single and consistent view of the pages recently
accessed. That makes sequential detection for cooperative processes possible.USER REPORT
-----------
Vladislav recommends the addition of context readahead as a result of his SCST
benchmarks. It leads to 6%~40% performance gains in various cases and achieves
equal performance in others. http://lkml.org/lkml/2009/3/19/239OVERHEADS
---------
In theory, it introduces one extra page cache lookup per random read. However
the below benchmark shows context readahead to be slightly faster, wondering..Randomly reading 200MB amount of data on a sparse file, repeat 20 times for
each block size. The average throughputs are:original ra context ra gain
4K random reads: 65.561MB/s 65.648MB/s +0.1%
16K random reads: 124.767MB/s 124.951MB/s +0.1%
64K random reads: 162.123MB/s 162.278MB/s +0.1%Cc: Jens Axboe
Cc: Jeff Moyer
Tested-by: Vladislav Bolkhovitin
Signed-off-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Split all readahead cases, and move the random one to bottom.
No behavior changes.
This is to prepare for the introduction of context readahead, and make it
easy for inserting accounting/tracing points for each case.Signed-off-by: Wu Fengguang
Cc: Vladislav Bolkhovitin
Cc: Jens Axboe
Cc: Jeff Moyer
Cc: Nick Piggin
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Mmap read-around now shares the same code style and data structure with
readahead code.This also removes do_page_cache_readahead(). Its last user, mmap
read-around, has been changed to call ra_submit().The no-readahead-if-congested logic is dumped by the way. Users will be
pretty sensitive about the slow loading of executables. So it's
unfavorable to disabled mmap read-around on a congested queue.[akpm@linux-foundation.org: coding-style fixes]
Cc: Nick Piggin
Signed-off-by: Fengguang Wu
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The readahead call scheme is error-prone in that it expects the call sites
to check for async readahead after doing a sync one. I.e.if (!page)
page_cache_sync_readahead();
page = find_get_page();
if (page && PageReadahead(page))
page_cache_async_readahead();This is because PG_readahead could be set by a sync readahead for the
_current_ newly faulted in page, and the readahead code simply expects one
more callback on the same page to start the async readahead. If the
caller fails to do so, it will miss the PG_readahead bits and never able
to start an async readahead.Eliminate this insane constraint by piggy-backing the async part into the
current readahead window.Now if an async readahead should be started immediately after a sync one,
the readahead logic itself will do it. So the following code becomes
valid: (the 'else' in particular)if (!page)
page_cache_sync_readahead();
else if (PageReadahead(page))
page_cache_async_readahead();Cc: Nick Piggin
Signed-off-by: Wu Fengguang
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Make sure interleaved readahead size is larger than request size. This
also makes the readahead window grow up more quickly.Reported-by: Xu Chenfeng
Signed-off-by: Wu Fengguang
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
(hit_readahead_marker != 0) means the page at @offset is present, so we
can search for non-present page starting from @offset+1.Reported-by: Xu Chenfeng
Signed-off-by: Wu Fengguang
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Just in case someone aggressively sets a huge readahead size.
Cc: Nick Piggin
Signed-off-by: Wu Fengguang
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds