Commit d618a27c7808608e376de803a4fd3940f33776c2

Authored by Mel Gorman
Committed by Jiri Slaby
1 parent 967e64285a

mm: non-atomically mark page accessed during page cache allocation where possible

commit 2457aec63745e235bcafb7ef312b182d8682f0fc upstream.

aops->write_begin may allocate a new page and make it visible only to have
mark_page_accessed called almost immediately after.  Once the page is
visible the atomic operations are necessary which is noticable overhead
when writing to an in-memory filesystem like tmpfs but should also be
noticable with fast storage.  The objective of the patch is to initialse
the accessed information with non-atomic operations before the page is
visible.

The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial
allocation of a page cache page.  This patch adds an init_page_accessed()
helper which behaves like the first call to mark_page_accessed() but may
called before the page is visible and can be done non-atomically.

The primary APIs of concern in this care are the following and are used
by most filesystems.

	find_get_page
	find_lock_page
	find_or_create_page
	grab_cache_page_nowait
	grab_cache_page_write_begin

All of them are very similar in detail to the patch creates a core helper
pagecache_get_page() which takes a flags parameter that affects its
behavior such as whether the page should be marked accessed or not.  Then
old API is preserved but is basically a thin wrapper around this core
function.

Each of the filesystems are then updated to avoid calling
mark_page_accessed when it is known that the VM interfaces have already
done the job.  There is a slight snag in that the timing of the
mark_page_accessed() has now changed so in rare cases it's possible a page
gets to the end of the LRU as PageReferenced where as previously it might
have been repromoted.  This is expected to be rare but it's worth the
filesystem people thinking about it in case they see a problem with the
timing change.  It is also the case that some filesystems may be marking
pages accessed that previously did not but it makes sense that filesystems
have consistent behaviour in this regard.

The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations.  The size of the
file is 1/10th physical memory to avoid dirty page balancing.  In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO.  The sync results are expected to be
more stable.  The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.

The test machine was single socket and UMA to avoid any scheduling or NUMA
artifacts.  Throughput and wall times are presented for sync IO, only wall
times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison.  As async results were variable
do to writback timings, I'm only reporting the maximum figures.  The sync
results were stable enough to make the mean and stddev uninteresting.

The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running.

async dd
                                    3.15.0-rc3            3.15.0-rc3
                                       vanilla           accessed-v2
ext3    Max      elapsed     13.9900 (  0.00%)     11.5900 ( 17.16%)
tmpfs	Max      elapsed      0.5100 (  0.00%)      0.4900 (  3.92%)
btrfs   Max      elapsed     12.8100 (  0.00%)     12.7800 (  0.23%)
ext4	Max      elapsed     18.6000 (  0.00%)     13.3400 ( 28.28%)
xfs	Max      elapsed     12.5600 (  0.00%)      2.0900 ( 83.36%)

The XFS figure is a bit strange as it managed to avoid a worst case by
sheer luck but the average figures looked reasonable.

        samples percentage
ext3       86107    0.9783  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
ext3       23833    0.2710  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext3        5036    0.0573  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
ext4       64566    0.8961  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
ext4        5322    0.0713  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext4        2869    0.0384  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs        62126    1.7675  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
xfs         1904    0.0554  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs          103    0.0030  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
btrfs      10655    0.1338  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
btrfs       2020    0.0273  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
btrfs        587    0.0079  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
tmpfs      59562    3.2628  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
tmpfs       1210    0.0696  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
tmpfs         94    0.0054  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

[akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

Showing 17 changed files with 219 additions and 162 deletions Side-by-side Diff

fs/btrfs/extent_io.c
... ... @@ -4446,7 +4446,8 @@
4446 4446 spin_unlock(&eb->refs_lock);
4447 4447 }
4448 4448  
4449   -static void mark_extent_buffer_accessed(struct extent_buffer *eb)
  4449 +static void mark_extent_buffer_accessed(struct extent_buffer *eb,
  4450 + struct page *accessed)
4450 4451 {
4451 4452 unsigned long num_pages, i;
4452 4453  
... ... @@ -4455,7 +4456,8 @@
4455 4456 num_pages = num_extent_pages(eb->start, eb->len);
4456 4457 for (i = 0; i < num_pages; i++) {
4457 4458 struct page *p = extent_buffer_page(eb, i);
4458   - mark_page_accessed(p);
  4459 + if (p != accessed)
  4460 + mark_page_accessed(p);
4459 4461 }
4460 4462 }
4461 4463  
... ... @@ -4476,7 +4478,7 @@
4476 4478 eb = radix_tree_lookup(&tree->buffer, start >> PAGE_CACHE_SHIFT);
4477 4479 if (eb && atomic_inc_not_zero(&eb->refs)) {
4478 4480 rcu_read_unlock();
4479   - mark_extent_buffer_accessed(eb);
  4481 + mark_extent_buffer_accessed(eb, NULL);
4480 4482 return eb;
4481 4483 }
4482 4484 rcu_read_unlock();
... ... @@ -4504,7 +4506,7 @@
4504 4506 spin_unlock(&mapping->private_lock);
4505 4507 unlock_page(p);
4506 4508 page_cache_release(p);
4507   - mark_extent_buffer_accessed(exists);
  4509 + mark_extent_buffer_accessed(exists, p);
4508 4510 goto free_eb;
4509 4511 }
4510 4512  
... ... @@ -4519,7 +4521,6 @@
4519 4521 attach_extent_buffer_page(eb, p);
4520 4522 spin_unlock(&mapping->private_lock);
4521 4523 WARN_ON(PageDirty(p));
4522   - mark_page_accessed(p);
4523 4524 eb->pages[i] = p;
4524 4525 if (!PageUptodate(p))
4525 4526 uptodate = 0;
... ... @@ -4549,7 +4550,7 @@
4549 4550 }
4550 4551 spin_unlock(&tree->buffer_lock);
4551 4552 radix_tree_preload_end();
4552   - mark_extent_buffer_accessed(exists);
  4553 + mark_extent_buffer_accessed(exists, NULL);
4553 4554 goto free_eb;
4554 4555 }
4555 4556 /* add one reference for the tree */
... ... @@ -4595,7 +4596,7 @@
4595 4596 eb = radix_tree_lookup(&tree->buffer, start >> PAGE_CACHE_SHIFT);
4596 4597 if (eb && atomic_inc_not_zero(&eb->refs)) {
4597 4598 rcu_read_unlock();
4598   - mark_extent_buffer_accessed(eb);
  4599 + mark_extent_buffer_accessed(eb, NULL);
4599 4600 return eb;
4600 4601 }
4601 4602 rcu_read_unlock();
... ... @@ -471,11 +471,12 @@
471 471 for (i = 0; i < num_pages; i++) {
472 472 /* page checked is some magic around finding pages that
473 473 * have been modified without going through btrfs_set_page_dirty
474   - * clear it here
  474 + * clear it here. There should be no need to mark the pages
  475 + * accessed as prepare_pages should have marked them accessed
  476 + * in prepare_pages via find_or_create_page()
475 477 */
476 478 ClearPageChecked(pages[i]);
477 479 unlock_page(pages[i]);
478   - mark_page_accessed(pages[i]);
479 480 page_cache_release(pages[i]);
480 481 }
481 482 }
... ... @@ -227,7 +227,7 @@
227 227 int all_mapped = 1;
228 228  
229 229 index = block >> (PAGE_CACHE_SHIFT - bd_inode->i_blkbits);
230   - page = find_get_page(bd_mapping, index);
  230 + page = find_get_page_flags(bd_mapping, index, FGP_ACCESSED);
231 231 if (!page)
232 232 goto out;
233 233  
234 234  
235 235  
... ... @@ -1366,12 +1366,13 @@
1366 1366 struct buffer_head *bh = lookup_bh_lru(bdev, block, size);
1367 1367  
1368 1368 if (bh == NULL) {
  1369 + /* __find_get_block_slow will mark the page accessed */
1369 1370 bh = __find_get_block_slow(bdev, block);
1370 1371 if (bh)
1371 1372 bh_lru_install(bh);
1372   - }
1373   - if (bh)
  1373 + } else
1374 1374 touch_buffer(bh);
  1375 +
1375 1376 return bh;
1376 1377 }
1377 1378 EXPORT_SYMBOL(__find_get_block);
... ... @@ -1044,6 +1044,8 @@
1044 1044 * allocating. If we are looking at the buddy cache we would
1045 1045 * have taken a reference using ext4_mb_load_buddy and that
1046 1046 * would have pinned buddy page to page cache.
  1047 + * The call to ext4_mb_get_buddy_page_lock will mark the
  1048 + * page accessed.
1047 1049 */
1048 1050 ret = ext4_mb_get_buddy_page_lock(sb, group, &e4b);
1049 1051 if (ret || !EXT4_MB_GRP_NEED_INIT(this_grp)) {
... ... @@ -1062,7 +1064,6 @@
1062 1064 ret = -EIO;
1063 1065 goto err;
1064 1066 }
1065   - mark_page_accessed(page);
1066 1067  
1067 1068 if (e4b.bd_buddy_page == NULL) {
1068 1069 /*
... ... @@ -1082,7 +1083,6 @@
1082 1083 ret = -EIO;
1083 1084 goto err;
1084 1085 }
1085   - mark_page_accessed(page);
1086 1086 err:
1087 1087 ext4_mb_put_buddy_page_lock(&e4b);
1088 1088 return ret;
... ... @@ -1141,7 +1141,7 @@
1141 1141  
1142 1142 /* we could use find_or_create_page(), but it locks page
1143 1143 * what we'd like to avoid in fast path ... */
1144   - page = find_get_page(inode->i_mapping, pnum);
  1144 + page = find_get_page_flags(inode->i_mapping, pnum, FGP_ACCESSED);
1145 1145 if (page == NULL || !PageUptodate(page)) {
1146 1146 if (page)
1147 1147 /*
1148 1148  
1149 1149  
... ... @@ -1172,15 +1172,16 @@
1172 1172 ret = -EIO;
1173 1173 goto err;
1174 1174 }
  1175 +
  1176 + /* Pages marked accessed already */
1175 1177 e4b->bd_bitmap_page = page;
1176 1178 e4b->bd_bitmap = page_address(page) + (poff * sb->s_blocksize);
1177   - mark_page_accessed(page);
1178 1179  
1179 1180 block++;
1180 1181 pnum = block / blocks_per_page;
1181 1182 poff = block % blocks_per_page;
1182 1183  
1183   - page = find_get_page(inode->i_mapping, pnum);
  1184 + page = find_get_page_flags(inode->i_mapping, pnum, FGP_ACCESSED);
1184 1185 if (page == NULL || !PageUptodate(page)) {
1185 1186 if (page)
1186 1187 page_cache_release(page);
1187 1188  
... ... @@ -1201,9 +1202,10 @@
1201 1202 ret = -EIO;
1202 1203 goto err;
1203 1204 }
  1205 +
  1206 + /* Pages marked accessed already */
1204 1207 e4b->bd_buddy_page = page;
1205 1208 e4b->bd_buddy = page_address(page) + (poff * sb->s_blocksize);
1206   - mark_page_accessed(page);
1207 1209  
1208 1210 BUG_ON(e4b->bd_bitmap_page == NULL);
1209 1211 BUG_ON(e4b->bd_buddy_page == NULL);
fs/f2fs/checkpoint.c
... ... @@ -70,7 +70,6 @@
70 70 goto repeat;
71 71 }
72 72 out:
73   - mark_page_accessed(page);
74 73 return page;
75 74 }
76 75  
... ... @@ -970,7 +970,6 @@
970 970 }
971 971 got_it:
972 972 BUG_ON(nid != nid_of_node(page));
973   - mark_page_accessed(page);
974 973 return page;
975 974 }
976 975  
... ... @@ -1026,7 +1025,6 @@
1026 1025 f2fs_put_page(page, 1);
1027 1026 return ERR_PTR(-EIO);
1028 1027 }
1029   - mark_page_accessed(page);
1030 1028 return page;
1031 1029 }
1032 1030  
... ... @@ -988,8 +988,6 @@
988 988 tmp = iov_iter_copy_from_user_atomic(page, ii, offset, bytes);
989 989 flush_dcache_page(page);
990 990  
991   - mark_page_accessed(page);
992   -
993 991 if (!tmp) {
994 992 unlock_page(page);
995 993 page_cache_release(page);
... ... @@ -517,7 +517,6 @@
517 517 p = kmap_atomic(page);
518 518 memcpy(buf + copied, p + offset, amt);
519 519 kunmap_atomic(p);
520   - mark_page_accessed(page);
521 520 page_cache_release(page);
522 521 copied += amt;
523 522 index++;
... ... @@ -128,7 +128,8 @@
128 128 yield();
129 129 }
130 130 } else {
131   - page = find_lock_page(mapping, index);
  131 + page = find_get_page_flags(mapping, index,
  132 + FGP_LOCK|FGP_ACCESSED);
132 133 if (!page)
133 134 return NULL;
134 135 }
... ... @@ -145,7 +146,6 @@
145 146 map_bh(bh, sdp->sd_vfs, blkno);
146 147  
147 148 unlock_page(page);
148   - mark_page_accessed(page);
149 149 page_cache_release(page);
150 150  
151 151 return bh;
... ... @@ -1748,7 +1748,6 @@
1748 1748 if (page) {
1749 1749 set_page_dirty(page);
1750 1750 unlock_page(page);
1751   - mark_page_accessed(page);
1752 1751 page_cache_release(page);
1753 1752 }
1754 1753 ntfs_debug("Done.");
... ... @@ -2060,7 +2060,6 @@
2060 2060 }
2061 2061 do {
2062 2062 unlock_page(pages[--do_pages]);
2063   - mark_page_accessed(pages[do_pages]);
2064 2063 page_cache_release(pages[do_pages]);
2065 2064 } while (do_pages);
2066 2065 if (unlikely(status))
include/linux/page-flags.h
... ... @@ -198,6 +198,7 @@
198 198 TESTPAGEFLAG(Locked, locked)
199 199 PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error)
200 200 PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced)
  201 + __SETPAGEFLAG(Referenced, referenced)
201 202 PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty)
202 203 PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru)
203 204 PAGEFLAG(Active, active) __CLEARPAGEFLAG(Active, active)
include/linux/pagemap.h
... ... @@ -248,12 +248,109 @@
248 248 pgoff_t page_cache_prev_hole(struct address_space *mapping,
249 249 pgoff_t index, unsigned long max_scan);
250 250  
  251 +#define FGP_ACCESSED 0x00000001
  252 +#define FGP_LOCK 0x00000002
  253 +#define FGP_CREAT 0x00000004
  254 +#define FGP_WRITE 0x00000008
  255 +#define FGP_NOFS 0x00000010
  256 +#define FGP_NOWAIT 0x00000020
  257 +
  258 +struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
  259 + int fgp_flags, gfp_t cache_gfp_mask, gfp_t radix_gfp_mask);
  260 +
  261 +/**
  262 + * find_get_page - find and get a page reference
  263 + * @mapping: the address_space to search
  264 + * @offset: the page index
  265 + *
  266 + * Looks up the page cache slot at @mapping & @offset. If there is a
  267 + * page cache page, it is returned with an increased refcount.
  268 + *
  269 + * Otherwise, %NULL is returned.
  270 + */
  271 +static inline struct page *find_get_page(struct address_space *mapping,
  272 + pgoff_t offset)
  273 +{
  274 + return pagecache_get_page(mapping, offset, 0, 0, 0);
  275 +}
  276 +
  277 +static inline struct page *find_get_page_flags(struct address_space *mapping,
  278 + pgoff_t offset, int fgp_flags)
  279 +{
  280 + return pagecache_get_page(mapping, offset, fgp_flags, 0, 0);
  281 +}
  282 +
  283 +/**
  284 + * find_lock_page - locate, pin and lock a pagecache page
  285 + * pagecache_get_page - find and get a page reference
  286 + * @mapping: the address_space to search
  287 + * @offset: the page index
  288 + *
  289 + * Looks up the page cache slot at @mapping & @offset. If there is a
  290 + * page cache page, it is returned locked and with an increased
  291 + * refcount.
  292 + *
  293 + * Otherwise, %NULL is returned.
  294 + *
  295 + * find_lock_page() may sleep.
  296 + */
  297 +static inline struct page *find_lock_page(struct address_space *mapping,
  298 + pgoff_t offset)
  299 +{
  300 + return pagecache_get_page(mapping, offset, FGP_LOCK, 0, 0);
  301 +}
  302 +
  303 +/**
  304 + * find_or_create_page - locate or add a pagecache page
  305 + * @mapping: the page's address_space
  306 + * @index: the page's index into the mapping
  307 + * @gfp_mask: page allocation mode
  308 + *
  309 + * Looks up the page cache slot at @mapping & @offset. If there is a
  310 + * page cache page, it is returned locked and with an increased
  311 + * refcount.
  312 + *
  313 + * If the page is not present, a new page is allocated using @gfp_mask
  314 + * and added to the page cache and the VM's LRU list. The page is
  315 + * returned locked and with an increased refcount.
  316 + *
  317 + * On memory exhaustion, %NULL is returned.
  318 + *
  319 + * find_or_create_page() may sleep, even if @gfp_flags specifies an
  320 + * atomic allocation!
  321 + */
  322 +static inline struct page *find_or_create_page(struct address_space *mapping,
  323 + pgoff_t offset, gfp_t gfp_mask)
  324 +{
  325 + return pagecache_get_page(mapping, offset,
  326 + FGP_LOCK|FGP_ACCESSED|FGP_CREAT,
  327 + gfp_mask, gfp_mask & GFP_RECLAIM_MASK);
  328 +}
  329 +
  330 +/**
  331 + * grab_cache_page_nowait - returns locked page at given index in given cache
  332 + * @mapping: target address_space
  333 + * @index: the page index
  334 + *
  335 + * Same as grab_cache_page(), but do not wait if the page is unavailable.
  336 + * This is intended for speculative data generators, where the data can
  337 + * be regenerated if the page couldn't be grabbed. This routine should
  338 + * be safe to call while holding the lock for another page.
  339 + *
  340 + * Clear __GFP_FS when allocating the page to avoid recursion into the fs
  341 + * and deadlock against the caller's locked page.
  342 + */
  343 +static inline struct page *grab_cache_page_nowait(struct address_space *mapping,
  344 + pgoff_t index)
  345 +{
  346 + return pagecache_get_page(mapping, index,
  347 + FGP_LOCK|FGP_CREAT|FGP_NOFS|FGP_NOWAIT,
  348 + mapping_gfp_mask(mapping),
  349 + GFP_NOFS);
  350 +}
  351 +
251 352 struct page *find_get_entry(struct address_space *mapping, pgoff_t offset);
252   -struct page *find_get_page(struct address_space *mapping, pgoff_t offset);
253 353 struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset);
254   -struct page *find_lock_page(struct address_space *mapping, pgoff_t offset);
255   -struct page *find_or_create_page(struct address_space *mapping, pgoff_t index,
256   - gfp_t gfp_mask);
257 354 unsigned find_get_entries(struct address_space *mapping, pgoff_t start,
258 355 unsigned int nr_entries, struct page **entries,
259 356 pgoff_t *indices);
... ... @@ -276,8 +373,6 @@
276 373 return find_or_create_page(mapping, index, mapping_gfp_mask(mapping));
277 374 }
278 375  
279   -extern struct page * grab_cache_page_nowait(struct address_space *mapping,
280   - pgoff_t index);
281 376 extern struct page * read_cache_page(struct address_space *mapping,
282 377 pgoff_t index, filler_t *filler, void *data);
283 378 extern struct page * read_cache_page_gfp(struct address_space *mapping,
include/linux/swap.h
... ... @@ -275,6 +275,7 @@
275 275 struct lruvec *lruvec, struct list_head *head);
276 276 extern void activate_page(struct page *);
277 277 extern void mark_page_accessed(struct page *);
  278 +extern void init_page_accessed(struct page *page);
278 279 extern void lru_add_drain(void);
279 280 extern void lru_add_drain_cpu(int cpu);
280 281 extern void lru_add_drain_all(void);
... ... @@ -848,26 +848,6 @@
848 848 EXPORT_SYMBOL(find_get_entry);
849 849  
850 850 /**
851   - * find_get_page - find and get a page reference
852   - * @mapping: the address_space to search
853   - * @offset: the page index
854   - *
855   - * Looks up the page cache slot at @mapping & @offset. If there is a
856   - * page cache page, it is returned with an increased refcount.
857   - *
858   - * Otherwise, %NULL is returned.
859   - */
860   -struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
861   -{
862   - struct page *page = find_get_entry(mapping, offset);
863   -
864   - if (radix_tree_exceptional_entry(page))
865   - page = NULL;
866   - return page;
867   -}
868   -EXPORT_SYMBOL(find_get_page);
869   -
870   -/**
871 851 * find_lock_entry - locate, pin and lock a page cache entry
872 852 * @mapping: the address_space to search
873 853 * @offset: the page cache index
874 854  
875 855  
876 856  
877 857  
878 858  
879 859  
880 860  
881 861  
882 862  
883 863  
... ... @@ -904,66 +884,84 @@
904 884 EXPORT_SYMBOL(find_lock_entry);
905 885  
906 886 /**
907   - * find_lock_page - locate, pin and lock a pagecache page
  887 + * pagecache_get_page - find and get a page reference
908 888 * @mapping: the address_space to search
909 889 * @offset: the page index
  890 + * @fgp_flags: PCG flags
  891 + * @gfp_mask: gfp mask to use if a page is to be allocated
910 892 *
911   - * Looks up the page cache slot at @mapping & @offset. If there is a
912   - * page cache page, it is returned locked and with an increased
913   - * refcount.
  893 + * Looks up the page cache slot at @mapping & @offset.
914 894 *
915   - * Otherwise, %NULL is returned.
  895 + * PCG flags modify how the page is returned
916 896 *
917   - * find_lock_page() may sleep.
918   - */
919   -struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
920   -{
921   - struct page *page = find_lock_entry(mapping, offset);
922   -
923   - if (radix_tree_exceptional_entry(page))
924   - page = NULL;
925   - return page;
926   -}
927   -EXPORT_SYMBOL(find_lock_page);
928   -
929   -/**
930   - * find_or_create_page - locate or add a pagecache page
931   - * @mapping: the page's address_space
932   - * @index: the page's index into the mapping
933   - * @gfp_mask: page allocation mode
  897 + * FGP_ACCESSED: the page will be marked accessed
  898 + * FGP_LOCK: Page is return locked
  899 + * FGP_CREAT: If page is not present then a new page is allocated using
  900 + * @gfp_mask and added to the page cache and the VM's LRU
  901 + * list. The page is returned locked and with an increased
  902 + * refcount. Otherwise, %NULL is returned.
934 903 *
935   - * Looks up the page cache slot at @mapping & @offset. If there is a
936   - * page cache page, it is returned locked and with an increased
937   - * refcount.
  904 + * If FGP_LOCK or FGP_CREAT are specified then the function may sleep even
  905 + * if the GFP flags specified for FGP_CREAT are atomic.
938 906 *
939   - * If the page is not present, a new page is allocated using @gfp_mask
940   - * and added to the page cache and the VM's LRU list. The page is
941   - * returned locked and with an increased refcount.
942   - *
943   - * On memory exhaustion, %NULL is returned.
944   - *
945   - * find_or_create_page() may sleep, even if @gfp_flags specifies an
946   - * atomic allocation!
  907 + * If there is a page cache page, it is returned with an increased refcount.
947 908 */
948   -struct page *find_or_create_page(struct address_space *mapping,
949   - pgoff_t index, gfp_t gfp_mask)
  909 +struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
  910 + int fgp_flags, gfp_t cache_gfp_mask, gfp_t radix_gfp_mask)
950 911 {
951 912 struct page *page;
952   - int err;
  913 +
953 914 repeat:
954   - page = find_lock_page(mapping, index);
955   - if (!page) {
956   - page = __page_cache_alloc(gfp_mask);
  915 + page = find_get_entry(mapping, offset);
  916 + if (radix_tree_exceptional_entry(page))
  917 + page = NULL;
  918 + if (!page)
  919 + goto no_page;
  920 +
  921 + if (fgp_flags & FGP_LOCK) {
  922 + if (fgp_flags & FGP_NOWAIT) {
  923 + if (!trylock_page(page)) {
  924 + page_cache_release(page);
  925 + return NULL;
  926 + }
  927 + } else {
  928 + lock_page(page);
  929 + }
  930 +
  931 + /* Has the page been truncated? */
  932 + if (unlikely(page->mapping != mapping)) {
  933 + unlock_page(page);
  934 + page_cache_release(page);
  935 + goto repeat;
  936 + }
  937 + VM_BUG_ON(page->index != offset);
  938 + }
  939 +
  940 + if (page && (fgp_flags & FGP_ACCESSED))
  941 + mark_page_accessed(page);
  942 +
  943 +no_page:
  944 + if (!page && (fgp_flags & FGP_CREAT)) {
  945 + int err;
  946 + if ((fgp_flags & FGP_WRITE) && mapping_cap_account_dirty(mapping))
  947 + cache_gfp_mask |= __GFP_WRITE;
  948 + if (fgp_flags & FGP_NOFS) {
  949 + cache_gfp_mask &= ~__GFP_FS;
  950 + radix_gfp_mask &= ~__GFP_FS;
  951 + }
  952 +
  953 + page = __page_cache_alloc(cache_gfp_mask);
957 954 if (!page)
958 955 return NULL;
959   - /*
960   - * We want a regular kernel memory (not highmem or DMA etc)
961   - * allocation for the radix tree nodes, but we need to honour
962   - * the context-specific requirements the caller has asked for.
963   - * GFP_RECLAIM_MASK collects those requirements.
964   - */
965   - err = add_to_page_cache_lru(page, mapping, index,
966   - (gfp_mask & GFP_RECLAIM_MASK));
  956 +
  957 + if (WARN_ON_ONCE(!(fgp_flags & FGP_LOCK)))
  958 + fgp_flags |= FGP_LOCK;
  959 +
  960 + /* Init accessed so avoit atomic mark_page_accessed later */
  961 + if (fgp_flags & FGP_ACCESSED)
  962 + init_page_accessed(page);
  963 +
  964 + err = add_to_page_cache_lru(page, mapping, offset, radix_gfp_mask);
967 965 if (unlikely(err)) {
968 966 page_cache_release(page);
969 967 page = NULL;
970 968  
... ... @@ -971,9 +969,10 @@
971 969 goto repeat;
972 970 }
973 971 }
  972 +
974 973 return page;
975 974 }
976   -EXPORT_SYMBOL(find_or_create_page);
  975 +EXPORT_SYMBOL(pagecache_get_page);
977 976  
978 977 /**
979 978 * find_get_entries - gang pagecache lookup
... ... @@ -1263,39 +1262,6 @@
1263 1262 }
1264 1263 EXPORT_SYMBOL(find_get_pages_tag);
1265 1264  
1266   -/**
1267   - * grab_cache_page_nowait - returns locked page at given index in given cache
1268   - * @mapping: target address_space
1269   - * @index: the page index
1270   - *
1271   - * Same as grab_cache_page(), but do not wait if the page is unavailable.
1272   - * This is intended for speculative data generators, where the data can
1273   - * be regenerated if the page couldn't be grabbed. This routine should
1274   - * be safe to call while holding the lock for another page.
1275   - *
1276   - * Clear __GFP_FS when allocating the page to avoid recursion into the fs
1277   - * and deadlock against the caller's locked page.
1278   - */
1279   -struct page *
1280   -grab_cache_page_nowait(struct address_space *mapping, pgoff_t index)
1281   -{
1282   - struct page *page = find_get_page(mapping, index);
1283   -
1284   - if (page) {
1285   - if (trylock_page(page))
1286   - return page;
1287   - page_cache_release(page);
1288   - return NULL;
1289   - }
1290   - page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
1291   - if (page && add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) {
1292   - page_cache_release(page);
1293   - page = NULL;
1294   - }
1295   - return page;
1296   -}
1297   -EXPORT_SYMBOL(grab_cache_page_nowait);
1298   -
1299 1265 /*
1300 1266 * CD/DVDs are error prone. When a medium error occurs, the driver may fail
1301 1267 * a _large_ part of the i/o request. Imagine the worst scenario:
... ... @@ -2399,7 +2365,6 @@
2399 2365 {
2400 2366 const struct address_space_operations *aops = mapping->a_ops;
2401 2367  
2402   - mark_page_accessed(page);
2403 2368 return aops->write_end(file, mapping, pos, len, copied, page, fsdata);
2404 2369 }
2405 2370 EXPORT_SYMBOL(pagecache_write_end);
2406 2371  
2407 2372  
2408 2373  
2409 2374  
2410 2375  
... ... @@ -2481,34 +2446,18 @@
2481 2446 struct page *grab_cache_page_write_begin(struct address_space *mapping,
2482 2447 pgoff_t index, unsigned flags)
2483 2448 {
2484   - int status;
2485   - gfp_t gfp_mask;
2486 2449 struct page *page;
2487   - gfp_t gfp_notmask = 0;
  2450 + int fgp_flags = FGP_LOCK|FGP_ACCESSED|FGP_WRITE|FGP_CREAT;
2488 2451  
2489   - gfp_mask = mapping_gfp_mask(mapping);
2490   - if (mapping_cap_account_dirty(mapping))
2491   - gfp_mask |= __GFP_WRITE;
2492 2452 if (flags & AOP_FLAG_NOFS)
2493   - gfp_notmask = __GFP_FS;
2494   -repeat:
2495   - page = find_lock_page(mapping, index);
  2453 + fgp_flags |= FGP_NOFS;
  2454 +
  2455 + page = pagecache_get_page(mapping, index, fgp_flags,
  2456 + mapping_gfp_mask(mapping),
  2457 + GFP_KERNEL);
2496 2458 if (page)
2497   - goto found;
  2459 + wait_for_stable_page(page);
2498 2460  
2499   - page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
2500   - if (!page)
2501   - return NULL;
2502   - status = add_to_page_cache_lru(page, mapping, index,
2503   - GFP_KERNEL & ~gfp_notmask);
2504   - if (unlikely(status)) {
2505   - page_cache_release(page);
2506   - if (status == -EEXIST)
2507   - goto repeat;
2508   - return NULL;
2509   - }
2510   -found:
2511   - wait_for_stable_page(page);
2512 2461 return page;
2513 2462 }
2514 2463 EXPORT_SYMBOL(grab_cache_page_write_begin);
... ... @@ -2557,7 +2506,7 @@
2557 2506  
2558 2507 status = a_ops->write_begin(file, mapping, pos, bytes, flags,
2559 2508 &page, &fsdata);
2560   - if (unlikely(status))
  2509 + if (unlikely(status < 0))
2561 2510 break;
2562 2511  
2563 2512 if (mapping_writably_mapped(mapping))
... ... @@ -2566,7 +2515,6 @@
2566 2515 copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
2567 2516 flush_dcache_page(page);
2568 2517  
2569   - mark_page_accessed(page);
2570 2518 status = a_ops->write_end(file, mapping, pos, bytes, copied,
2571 2519 page, fsdata);
2572 2520 if (unlikely(status < 0))
... ... @@ -1440,9 +1440,13 @@
1440 1440 loff_t pos, unsigned len, unsigned flags,
1441 1441 struct page **pagep, void **fsdata)
1442 1442 {
  1443 + int ret;
1443 1444 struct inode *inode = mapping->host;
1444 1445 pgoff_t index = pos >> PAGE_CACHE_SHIFT;
1445   - return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
  1446 + ret = shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
  1447 + if (ret == 0 && *pagep)
  1448 + init_page_accessed(*pagep);
  1449 + return ret;
1446 1450 }
1447 1451  
1448 1452 static int
... ... @@ -548,6 +548,17 @@
548 548 }
549 549 EXPORT_SYMBOL(mark_page_accessed);
550 550  
  551 +/*
  552 + * Used to mark_page_accessed(page) that is not visible yet and when it is
  553 + * still safe to use non-atomic ops
  554 + */
  555 +void init_page_accessed(struct page *page)
  556 +{
  557 + if (!PageReferenced(page))
  558 + __SetPageReferenced(page);
  559 +}
  560 +EXPORT_SYMBOL(init_page_accessed);
  561 +
551 562 static void __lru_cache_add(struct page *page)
552 563 {
553 564 struct pagevec *pvec = &get_cpu_var(lru_add_pvec);