Commit 579f82901f6f41256642936d7e632f3979ad76d4

Authored by Shaohua Li
Committed by Linus Torvalds
1 parent fb951eb5e1

swap: add a simple detector for inappropriate swapin readahead

This is a patch to improve swap readahead algorithm.  It's from Hugh and
I slightly changed it.

Hugh's original changelog:

swapin readahead does a blind readahead, whether or not the swapin is
sequential.  This may be ok on harddisk, because large reads have
relatively small costs, and if the readahead pages are unneeded they can
be reclaimed easily - though, what if their allocation forced reclaim of
useful pages? But on SSD devices large reads are more expensive than
small ones: if the readahead pages are unneeded, reading them in caused
significant overhead.

This patch adds very simplistic random read detection.  Stealing the
PageReadahead technique from Konstantin Khlebnikov's patch, avoiding the
vma/anon_vma sophistications of Shaohua Li's patch, swapin_nr_pages()
simply looks at readahead's current success rate, and narrows or widens
its readahead window accordingly.  There is little science to its
heuristic: it's about as stupid as can be whilst remaining effective.

The table below shows elapsed times (in centiseconds) when running a
single repetitive swapping load across a 1000MB mapping in 900MB ram
with 1GB swap (the harddisk tests had taken painfully too long when I
used mem=500M, but SSD shows similar results for that).

Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes his
Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1 patch
which Shaohua showed to be defective; HughNew this Nov 14 patch, with
page_cluster as usual at default of 3 (8-page reads); HughPC4 this same
patch with page_cluster 4 (16-page reads); HughPC0 with page_cluster 0
(1-page reads: no readahead).

HDD for swapping to harddisk, SSD for swapping to VertexII SSD.  Seq for
sequential access to the mapping, cycling five times around; Rand for
the same number of random touches.  Anon for a MAP_PRIVATE anon mapping;
Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.

One weakness of Shaohua's vma/anon_vma approach was that it did not
optimize Shmem: seen below.  Konstantin's approach was perhaps mistuned,
50% slower on Seq: did not compete and is not shown below.

HDD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
Seq Anon     73921   76210   75611   76904   78191  121542
Seq Shmem    73601   73176   73855   72947   74543  118322
Rand Anon   895392  831243  871569  845197  846496  841680
Rand Shmem 1058375 1053486  827935  764955  764376  756489

SSD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
Seq Anon     24634   24198   24673   25107   21614   70018
Seq Shmem    24959   24932   25052   25703   22030   69678
Rand Anon    43014   26146   28075   25989   26935   25901
Rand Shmem   45349   45215   28249   24268   24138   24332

These tests are, of course, two extremes of a very simple case: under
heavier mixed loads I've not yet observed any consistent improvement or
degradation, and wider testing would be welcome.

Shaohua Li:

Test shows Vanilla is slightly better in sequential workload than Hugh's
patch.  I observed with Hugh's patch sometimes the readahead size is
shrinked too fast (from 8 to 1 immediately) in sequential workload if
there is no hit.  And in such case, continuing doing readahead is good
actually.

I don't prepare a sophisticated algorithm for the sequential workload
because so far we can't guarantee sequential accessed pages are swap out
sequentially.  So I slightly change Hugh's heuristic - don't shrink
readahead size too fast.

Here is my test result (unit second, 3 runs average):
	Vanilla		Hugh		New
Seq	356		370		360
Random	4525		2447		2444

Attached graph is the swapin/swapout throughput I collected with 'vmstat
2'.  The first part is running a random workload (till around 1200 of
the x-axis) and the second part is running a sequential workload.
swapin and swapout throughput are almost identical in steady state in
both workloads.  These are expected behavior.  while in Vanilla, swapin
is much bigger than swapout especially in random workload (because wrong
readahead).

Original patches by: Shaohua Li and Konstantin Khlebnikov.

[fengguang.wu@intel.com: swapin_nr_pages() can be static]
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Showing 2 changed files with 62 additions and 5 deletions Inline Diff

include/linux/page-flags.h
1 /* 1 /*
2 * Macros for manipulating and testing page->flags 2 * Macros for manipulating and testing page->flags
3 */ 3 */
4 4
5 #ifndef PAGE_FLAGS_H 5 #ifndef PAGE_FLAGS_H
6 #define PAGE_FLAGS_H 6 #define PAGE_FLAGS_H
7 7
8 #include <linux/types.h> 8 #include <linux/types.h>
9 #include <linux/bug.h> 9 #include <linux/bug.h>
10 #include <linux/mmdebug.h> 10 #include <linux/mmdebug.h>
11 #ifndef __GENERATING_BOUNDS_H 11 #ifndef __GENERATING_BOUNDS_H
12 #include <linux/mm_types.h> 12 #include <linux/mm_types.h>
13 #include <generated/bounds.h> 13 #include <generated/bounds.h>
14 #endif /* !__GENERATING_BOUNDS_H */ 14 #endif /* !__GENERATING_BOUNDS_H */
15 15
16 /* 16 /*
17 * Various page->flags bits: 17 * Various page->flags bits:
18 * 18 *
19 * PG_reserved is set for special pages, which can never be swapped out. Some 19 * PG_reserved is set for special pages, which can never be swapped out. Some
20 * of them might not even exist (eg empty_bad_page)... 20 * of them might not even exist (eg empty_bad_page)...
21 * 21 *
22 * The PG_private bitflag is set on pagecache pages if they contain filesystem 22 * The PG_private bitflag is set on pagecache pages if they contain filesystem
23 * specific data (which is normally at page->private). It can be used by 23 * specific data (which is normally at page->private). It can be used by
24 * private allocations for its own usage. 24 * private allocations for its own usage.
25 * 25 *
26 * During initiation of disk I/O, PG_locked is set. This bit is set before I/O 26 * During initiation of disk I/O, PG_locked is set. This bit is set before I/O
27 * and cleared when writeback _starts_ or when read _completes_. PG_writeback 27 * and cleared when writeback _starts_ or when read _completes_. PG_writeback
28 * is set before writeback starts and cleared when it finishes. 28 * is set before writeback starts and cleared when it finishes.
29 * 29 *
30 * PG_locked also pins a page in pagecache, and blocks truncation of the file 30 * PG_locked also pins a page in pagecache, and blocks truncation of the file
31 * while it is held. 31 * while it is held.
32 * 32 *
33 * page_waitqueue(page) is a wait queue of all tasks waiting for the page 33 * page_waitqueue(page) is a wait queue of all tasks waiting for the page
34 * to become unlocked. 34 * to become unlocked.
35 * 35 *
36 * PG_uptodate tells whether the page's contents is valid. When a read 36 * PG_uptodate tells whether the page's contents is valid. When a read
37 * completes, the page becomes uptodate, unless a disk I/O error happened. 37 * completes, the page becomes uptodate, unless a disk I/O error happened.
38 * 38 *
39 * PG_referenced, PG_reclaim are used for page reclaim for anonymous and 39 * PG_referenced, PG_reclaim are used for page reclaim for anonymous and
40 * file-backed pagecache (see mm/vmscan.c). 40 * file-backed pagecache (see mm/vmscan.c).
41 * 41 *
42 * PG_error is set to indicate that an I/O error occurred on this page. 42 * PG_error is set to indicate that an I/O error occurred on this page.
43 * 43 *
44 * PG_arch_1 is an architecture specific page state bit. The generic code 44 * PG_arch_1 is an architecture specific page state bit. The generic code
45 * guarantees that this bit is cleared for a page when it first is entered into 45 * guarantees that this bit is cleared for a page when it first is entered into
46 * the page cache. 46 * the page cache.
47 * 47 *
48 * PG_highmem pages are not permanently mapped into the kernel virtual address 48 * PG_highmem pages are not permanently mapped into the kernel virtual address
49 * space, they need to be kmapped separately for doing IO on the pages. The 49 * space, they need to be kmapped separately for doing IO on the pages. The
50 * struct page (these bits with information) are always mapped into kernel 50 * struct page (these bits with information) are always mapped into kernel
51 * address space... 51 * address space...
52 * 52 *
53 * PG_hwpoison indicates that a page got corrupted in hardware and contains 53 * PG_hwpoison indicates that a page got corrupted in hardware and contains
54 * data with incorrect ECC bits that triggered a machine check. Accessing is 54 * data with incorrect ECC bits that triggered a machine check. Accessing is
55 * not safe since it may cause another machine check. Don't touch! 55 * not safe since it may cause another machine check. Don't touch!
56 */ 56 */
57 57
58 /* 58 /*
59 * Don't use the *_dontuse flags. Use the macros. Otherwise you'll break 59 * Don't use the *_dontuse flags. Use the macros. Otherwise you'll break
60 * locked- and dirty-page accounting. 60 * locked- and dirty-page accounting.
61 * 61 *
62 * The page flags field is split into two parts, the main flags area 62 * The page flags field is split into two parts, the main flags area
63 * which extends from the low bits upwards, and the fields area which 63 * which extends from the low bits upwards, and the fields area which
64 * extends from the high bits downwards. 64 * extends from the high bits downwards.
65 * 65 *
66 * | FIELD | ... | FLAGS | 66 * | FIELD | ... | FLAGS |
67 * N-1 ^ 0 67 * N-1 ^ 0
68 * (NR_PAGEFLAGS) 68 * (NR_PAGEFLAGS)
69 * 69 *
70 * The fields area is reserved for fields mapping zone, node (for NUMA) and 70 * The fields area is reserved for fields mapping zone, node (for NUMA) and
71 * SPARSEMEM section (for variants of SPARSEMEM that require section ids like 71 * SPARSEMEM section (for variants of SPARSEMEM that require section ids like
72 * SPARSEMEM_EXTREME with !SPARSEMEM_VMEMMAP). 72 * SPARSEMEM_EXTREME with !SPARSEMEM_VMEMMAP).
73 */ 73 */
74 enum pageflags { 74 enum pageflags {
75 PG_locked, /* Page is locked. Don't touch. */ 75 PG_locked, /* Page is locked. Don't touch. */
76 PG_error, 76 PG_error,
77 PG_referenced, 77 PG_referenced,
78 PG_uptodate, 78 PG_uptodate,
79 PG_dirty, 79 PG_dirty,
80 PG_lru, 80 PG_lru,
81 PG_active, 81 PG_active,
82 PG_slab, 82 PG_slab,
83 PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ 83 PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/
84 PG_arch_1, 84 PG_arch_1,
85 PG_reserved, 85 PG_reserved,
86 PG_private, /* If pagecache, has fs-private data */ 86 PG_private, /* If pagecache, has fs-private data */
87 PG_private_2, /* If pagecache, has fs aux data */ 87 PG_private_2, /* If pagecache, has fs aux data */
88 PG_writeback, /* Page is under writeback */ 88 PG_writeback, /* Page is under writeback */
89 #ifdef CONFIG_PAGEFLAGS_EXTENDED 89 #ifdef CONFIG_PAGEFLAGS_EXTENDED
90 PG_head, /* A head page */ 90 PG_head, /* A head page */
91 PG_tail, /* A tail page */ 91 PG_tail, /* A tail page */
92 #else 92 #else
93 PG_compound, /* A compound page */ 93 PG_compound, /* A compound page */
94 #endif 94 #endif
95 PG_swapcache, /* Swap page: swp_entry_t in private */ 95 PG_swapcache, /* Swap page: swp_entry_t in private */
96 PG_mappedtodisk, /* Has blocks allocated on-disk */ 96 PG_mappedtodisk, /* Has blocks allocated on-disk */
97 PG_reclaim, /* To be reclaimed asap */ 97 PG_reclaim, /* To be reclaimed asap */
98 PG_swapbacked, /* Page is backed by RAM/swap */ 98 PG_swapbacked, /* Page is backed by RAM/swap */
99 PG_unevictable, /* Page is "unevictable" */ 99 PG_unevictable, /* Page is "unevictable" */
100 #ifdef CONFIG_MMU 100 #ifdef CONFIG_MMU
101 PG_mlocked, /* Page is vma mlocked */ 101 PG_mlocked, /* Page is vma mlocked */
102 #endif 102 #endif
103 #ifdef CONFIG_ARCH_USES_PG_UNCACHED 103 #ifdef CONFIG_ARCH_USES_PG_UNCACHED
104 PG_uncached, /* Page has been mapped as uncached */ 104 PG_uncached, /* Page has been mapped as uncached */
105 #endif 105 #endif
106 #ifdef CONFIG_MEMORY_FAILURE 106 #ifdef CONFIG_MEMORY_FAILURE
107 PG_hwpoison, /* hardware poisoned page. Don't touch */ 107 PG_hwpoison, /* hardware poisoned page. Don't touch */
108 #endif 108 #endif
109 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 109 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
110 PG_compound_lock, 110 PG_compound_lock,
111 #endif 111 #endif
112 __NR_PAGEFLAGS, 112 __NR_PAGEFLAGS,
113 113
114 /* Filesystems */ 114 /* Filesystems */
115 PG_checked = PG_owner_priv_1, 115 PG_checked = PG_owner_priv_1,
116 116
117 /* Two page bits are conscripted by FS-Cache to maintain local caching 117 /* Two page bits are conscripted by FS-Cache to maintain local caching
118 * state. These bits are set on pages belonging to the netfs's inodes 118 * state. These bits are set on pages belonging to the netfs's inodes
119 * when those inodes are being locally cached. 119 * when those inodes are being locally cached.
120 */ 120 */
121 PG_fscache = PG_private_2, /* page backed by cache */ 121 PG_fscache = PG_private_2, /* page backed by cache */
122 122
123 /* XEN */ 123 /* XEN */
124 PG_pinned = PG_owner_priv_1, 124 PG_pinned = PG_owner_priv_1,
125 PG_savepinned = PG_dirty, 125 PG_savepinned = PG_dirty,
126 126
127 /* SLOB */ 127 /* SLOB */
128 PG_slob_free = PG_private, 128 PG_slob_free = PG_private,
129 }; 129 };
130 130
131 #ifndef __GENERATING_BOUNDS_H 131 #ifndef __GENERATING_BOUNDS_H
132 132
133 /* 133 /*
134 * Macros to create function definitions for page flags 134 * Macros to create function definitions for page flags
135 */ 135 */
136 #define TESTPAGEFLAG(uname, lname) \ 136 #define TESTPAGEFLAG(uname, lname) \
137 static inline int Page##uname(const struct page *page) \ 137 static inline int Page##uname(const struct page *page) \
138 { return test_bit(PG_##lname, &page->flags); } 138 { return test_bit(PG_##lname, &page->flags); }
139 139
140 #define SETPAGEFLAG(uname, lname) \ 140 #define SETPAGEFLAG(uname, lname) \
141 static inline void SetPage##uname(struct page *page) \ 141 static inline void SetPage##uname(struct page *page) \
142 { set_bit(PG_##lname, &page->flags); } 142 { set_bit(PG_##lname, &page->flags); }
143 143
144 #define CLEARPAGEFLAG(uname, lname) \ 144 #define CLEARPAGEFLAG(uname, lname) \
145 static inline void ClearPage##uname(struct page *page) \ 145 static inline void ClearPage##uname(struct page *page) \
146 { clear_bit(PG_##lname, &page->flags); } 146 { clear_bit(PG_##lname, &page->flags); }
147 147
148 #define __SETPAGEFLAG(uname, lname) \ 148 #define __SETPAGEFLAG(uname, lname) \
149 static inline void __SetPage##uname(struct page *page) \ 149 static inline void __SetPage##uname(struct page *page) \
150 { __set_bit(PG_##lname, &page->flags); } 150 { __set_bit(PG_##lname, &page->flags); }
151 151
152 #define __CLEARPAGEFLAG(uname, lname) \ 152 #define __CLEARPAGEFLAG(uname, lname) \
153 static inline void __ClearPage##uname(struct page *page) \ 153 static inline void __ClearPage##uname(struct page *page) \
154 { __clear_bit(PG_##lname, &page->flags); } 154 { __clear_bit(PG_##lname, &page->flags); }
155 155
156 #define TESTSETFLAG(uname, lname) \ 156 #define TESTSETFLAG(uname, lname) \
157 static inline int TestSetPage##uname(struct page *page) \ 157 static inline int TestSetPage##uname(struct page *page) \
158 { return test_and_set_bit(PG_##lname, &page->flags); } 158 { return test_and_set_bit(PG_##lname, &page->flags); }
159 159
160 #define TESTCLEARFLAG(uname, lname) \ 160 #define TESTCLEARFLAG(uname, lname) \
161 static inline int TestClearPage##uname(struct page *page) \ 161 static inline int TestClearPage##uname(struct page *page) \
162 { return test_and_clear_bit(PG_##lname, &page->flags); } 162 { return test_and_clear_bit(PG_##lname, &page->flags); }
163 163
164 #define __TESTCLEARFLAG(uname, lname) \ 164 #define __TESTCLEARFLAG(uname, lname) \
165 static inline int __TestClearPage##uname(struct page *page) \ 165 static inline int __TestClearPage##uname(struct page *page) \
166 { return __test_and_clear_bit(PG_##lname, &page->flags); } 166 { return __test_and_clear_bit(PG_##lname, &page->flags); }
167 167
168 #define PAGEFLAG(uname, lname) TESTPAGEFLAG(uname, lname) \ 168 #define PAGEFLAG(uname, lname) TESTPAGEFLAG(uname, lname) \
169 SETPAGEFLAG(uname, lname) CLEARPAGEFLAG(uname, lname) 169 SETPAGEFLAG(uname, lname) CLEARPAGEFLAG(uname, lname)
170 170
171 #define __PAGEFLAG(uname, lname) TESTPAGEFLAG(uname, lname) \ 171 #define __PAGEFLAG(uname, lname) TESTPAGEFLAG(uname, lname) \
172 __SETPAGEFLAG(uname, lname) __CLEARPAGEFLAG(uname, lname) 172 __SETPAGEFLAG(uname, lname) __CLEARPAGEFLAG(uname, lname)
173 173
174 #define PAGEFLAG_FALSE(uname) \ 174 #define PAGEFLAG_FALSE(uname) \
175 static inline int Page##uname(const struct page *page) \ 175 static inline int Page##uname(const struct page *page) \
176 { return 0; } 176 { return 0; }
177 177
178 #define TESTSCFLAG(uname, lname) \ 178 #define TESTSCFLAG(uname, lname) \
179 TESTSETFLAG(uname, lname) TESTCLEARFLAG(uname, lname) 179 TESTSETFLAG(uname, lname) TESTCLEARFLAG(uname, lname)
180 180
181 #define SETPAGEFLAG_NOOP(uname) \ 181 #define SETPAGEFLAG_NOOP(uname) \
182 static inline void SetPage##uname(struct page *page) { } 182 static inline void SetPage##uname(struct page *page) { }
183 183
184 #define CLEARPAGEFLAG_NOOP(uname) \ 184 #define CLEARPAGEFLAG_NOOP(uname) \
185 static inline void ClearPage##uname(struct page *page) { } 185 static inline void ClearPage##uname(struct page *page) { }
186 186
187 #define __CLEARPAGEFLAG_NOOP(uname) \ 187 #define __CLEARPAGEFLAG_NOOP(uname) \
188 static inline void __ClearPage##uname(struct page *page) { } 188 static inline void __ClearPage##uname(struct page *page) { }
189 189
190 #define TESTCLEARFLAG_FALSE(uname) \ 190 #define TESTCLEARFLAG_FALSE(uname) \
191 static inline int TestClearPage##uname(struct page *page) { return 0; } 191 static inline int TestClearPage##uname(struct page *page) { return 0; }
192 192
193 #define __TESTCLEARFLAG_FALSE(uname) \ 193 #define __TESTCLEARFLAG_FALSE(uname) \
194 static inline int __TestClearPage##uname(struct page *page) { return 0; } 194 static inline int __TestClearPage##uname(struct page *page) { return 0; }
195 195
196 struct page; /* forward declaration */ 196 struct page; /* forward declaration */
197 197
198 TESTPAGEFLAG(Locked, locked) 198 TESTPAGEFLAG(Locked, locked)
199 PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error) 199 PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error)
200 PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced) 200 PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced)
201 PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty) 201 PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty)
202 PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru) 202 PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru)
203 PAGEFLAG(Active, active) __CLEARPAGEFLAG(Active, active) 203 PAGEFLAG(Active, active) __CLEARPAGEFLAG(Active, active)
204 TESTCLEARFLAG(Active, active) 204 TESTCLEARFLAG(Active, active)
205 __PAGEFLAG(Slab, slab) 205 __PAGEFLAG(Slab, slab)
206 PAGEFLAG(Checked, checked) /* Used by some filesystems */ 206 PAGEFLAG(Checked, checked) /* Used by some filesystems */
207 PAGEFLAG(Pinned, pinned) TESTSCFLAG(Pinned, pinned) /* Xen */ 207 PAGEFLAG(Pinned, pinned) TESTSCFLAG(Pinned, pinned) /* Xen */
208 PAGEFLAG(SavePinned, savepinned); /* Xen */ 208 PAGEFLAG(SavePinned, savepinned); /* Xen */
209 PAGEFLAG(Reserved, reserved) __CLEARPAGEFLAG(Reserved, reserved) 209 PAGEFLAG(Reserved, reserved) __CLEARPAGEFLAG(Reserved, reserved)
210 PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) 210 PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
211 211
212 __PAGEFLAG(SlobFree, slob_free) 212 __PAGEFLAG(SlobFree, slob_free)
213 213
214 /* 214 /*
215 * Private page markings that may be used by the filesystem that owns the page 215 * Private page markings that may be used by the filesystem that owns the page
216 * for its own purposes. 216 * for its own purposes.
217 * - PG_private and PG_private_2 cause releasepage() and co to be invoked 217 * - PG_private and PG_private_2 cause releasepage() and co to be invoked
218 */ 218 */
219 PAGEFLAG(Private, private) __SETPAGEFLAG(Private, private) 219 PAGEFLAG(Private, private) __SETPAGEFLAG(Private, private)
220 __CLEARPAGEFLAG(Private, private) 220 __CLEARPAGEFLAG(Private, private)
221 PAGEFLAG(Private2, private_2) TESTSCFLAG(Private2, private_2) 221 PAGEFLAG(Private2, private_2) TESTSCFLAG(Private2, private_2)
222 PAGEFLAG(OwnerPriv1, owner_priv_1) TESTCLEARFLAG(OwnerPriv1, owner_priv_1) 222 PAGEFLAG(OwnerPriv1, owner_priv_1) TESTCLEARFLAG(OwnerPriv1, owner_priv_1)
223 223
224 /* 224 /*
225 * Only test-and-set exist for PG_writeback. The unconditional operators are 225 * Only test-and-set exist for PG_writeback. The unconditional operators are
226 * risky: they bypass page accounting. 226 * risky: they bypass page accounting.
227 */ 227 */
228 TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback) 228 TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
229 PAGEFLAG(MappedToDisk, mappedtodisk) 229 PAGEFLAG(MappedToDisk, mappedtodisk)
230 230
231 /* PG_readahead is only used for file reads; PG_reclaim is only for writes */ 231 /* PG_readahead is only used for reads; PG_reclaim is only for writes */
232 PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim) 232 PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
233 PAGEFLAG(Readahead, reclaim) /* Reminder to do async read-ahead */ 233 PAGEFLAG(Readahead, reclaim) TESTCLEARFLAG(Readahead, reclaim)
234 234
235 #ifdef CONFIG_HIGHMEM 235 #ifdef CONFIG_HIGHMEM
236 /* 236 /*
237 * Must use a macro here due to header dependency issues. page_zone() is not 237 * Must use a macro here due to header dependency issues. page_zone() is not
238 * available at this point. 238 * available at this point.
239 */ 239 */
240 #define PageHighMem(__p) is_highmem(page_zone(__p)) 240 #define PageHighMem(__p) is_highmem(page_zone(__p))
241 #else 241 #else
242 PAGEFLAG_FALSE(HighMem) 242 PAGEFLAG_FALSE(HighMem)
243 #endif 243 #endif
244 244
245 #ifdef CONFIG_SWAP 245 #ifdef CONFIG_SWAP
246 PAGEFLAG(SwapCache, swapcache) 246 PAGEFLAG(SwapCache, swapcache)
247 #else 247 #else
248 PAGEFLAG_FALSE(SwapCache) 248 PAGEFLAG_FALSE(SwapCache)
249 SETPAGEFLAG_NOOP(SwapCache) CLEARPAGEFLAG_NOOP(SwapCache) 249 SETPAGEFLAG_NOOP(SwapCache) CLEARPAGEFLAG_NOOP(SwapCache)
250 #endif 250 #endif
251 251
252 PAGEFLAG(Unevictable, unevictable) __CLEARPAGEFLAG(Unevictable, unevictable) 252 PAGEFLAG(Unevictable, unevictable) __CLEARPAGEFLAG(Unevictable, unevictable)
253 TESTCLEARFLAG(Unevictable, unevictable) 253 TESTCLEARFLAG(Unevictable, unevictable)
254 254
255 #ifdef CONFIG_MMU 255 #ifdef CONFIG_MMU
256 PAGEFLAG(Mlocked, mlocked) __CLEARPAGEFLAG(Mlocked, mlocked) 256 PAGEFLAG(Mlocked, mlocked) __CLEARPAGEFLAG(Mlocked, mlocked)
257 TESTSCFLAG(Mlocked, mlocked) __TESTCLEARFLAG(Mlocked, mlocked) 257 TESTSCFLAG(Mlocked, mlocked) __TESTCLEARFLAG(Mlocked, mlocked)
258 #else 258 #else
259 PAGEFLAG_FALSE(Mlocked) SETPAGEFLAG_NOOP(Mlocked) 259 PAGEFLAG_FALSE(Mlocked) SETPAGEFLAG_NOOP(Mlocked)
260 TESTCLEARFLAG_FALSE(Mlocked) __TESTCLEARFLAG_FALSE(Mlocked) 260 TESTCLEARFLAG_FALSE(Mlocked) __TESTCLEARFLAG_FALSE(Mlocked)
261 #endif 261 #endif
262 262
263 #ifdef CONFIG_ARCH_USES_PG_UNCACHED 263 #ifdef CONFIG_ARCH_USES_PG_UNCACHED
264 PAGEFLAG(Uncached, uncached) 264 PAGEFLAG(Uncached, uncached)
265 #else 265 #else
266 PAGEFLAG_FALSE(Uncached) 266 PAGEFLAG_FALSE(Uncached)
267 #endif 267 #endif
268 268
269 #ifdef CONFIG_MEMORY_FAILURE 269 #ifdef CONFIG_MEMORY_FAILURE
270 PAGEFLAG(HWPoison, hwpoison) 270 PAGEFLAG(HWPoison, hwpoison)
271 TESTSCFLAG(HWPoison, hwpoison) 271 TESTSCFLAG(HWPoison, hwpoison)
272 #define __PG_HWPOISON (1UL << PG_hwpoison) 272 #define __PG_HWPOISON (1UL << PG_hwpoison)
273 #else 273 #else
274 PAGEFLAG_FALSE(HWPoison) 274 PAGEFLAG_FALSE(HWPoison)
275 #define __PG_HWPOISON 0 275 #define __PG_HWPOISON 0
276 #endif 276 #endif
277 277
278 u64 stable_page_flags(struct page *page); 278 u64 stable_page_flags(struct page *page);
279 279
280 static inline int PageUptodate(struct page *page) 280 static inline int PageUptodate(struct page *page)
281 { 281 {
282 int ret = test_bit(PG_uptodate, &(page)->flags); 282 int ret = test_bit(PG_uptodate, &(page)->flags);
283 283
284 /* 284 /*
285 * Must ensure that the data we read out of the page is loaded 285 * Must ensure that the data we read out of the page is loaded
286 * _after_ we've loaded page->flags to check for PageUptodate. 286 * _after_ we've loaded page->flags to check for PageUptodate.
287 * We can skip the barrier if the page is not uptodate, because 287 * We can skip the barrier if the page is not uptodate, because
288 * we wouldn't be reading anything from it. 288 * we wouldn't be reading anything from it.
289 * 289 *
290 * See SetPageUptodate() for the other side of the story. 290 * See SetPageUptodate() for the other side of the story.
291 */ 291 */
292 if (ret) 292 if (ret)
293 smp_rmb(); 293 smp_rmb();
294 294
295 return ret; 295 return ret;
296 } 296 }
297 297
298 static inline void __SetPageUptodate(struct page *page) 298 static inline void __SetPageUptodate(struct page *page)
299 { 299 {
300 smp_wmb(); 300 smp_wmb();
301 __set_bit(PG_uptodate, &(page)->flags); 301 __set_bit(PG_uptodate, &(page)->flags);
302 } 302 }
303 303
304 static inline void SetPageUptodate(struct page *page) 304 static inline void SetPageUptodate(struct page *page)
305 { 305 {
306 /* 306 /*
307 * Memory barrier must be issued before setting the PG_uptodate bit, 307 * Memory barrier must be issued before setting the PG_uptodate bit,
308 * so that all previous stores issued in order to bring the page 308 * so that all previous stores issued in order to bring the page
309 * uptodate are actually visible before PageUptodate becomes true. 309 * uptodate are actually visible before PageUptodate becomes true.
310 */ 310 */
311 smp_wmb(); 311 smp_wmb();
312 set_bit(PG_uptodate, &(page)->flags); 312 set_bit(PG_uptodate, &(page)->flags);
313 } 313 }
314 314
315 CLEARPAGEFLAG(Uptodate, uptodate) 315 CLEARPAGEFLAG(Uptodate, uptodate)
316 316
317 extern void cancel_dirty_page(struct page *page, unsigned int account_size); 317 extern void cancel_dirty_page(struct page *page, unsigned int account_size);
318 318
319 int test_clear_page_writeback(struct page *page); 319 int test_clear_page_writeback(struct page *page);
320 int test_set_page_writeback(struct page *page); 320 int test_set_page_writeback(struct page *page);
321 321
322 static inline void set_page_writeback(struct page *page) 322 static inline void set_page_writeback(struct page *page)
323 { 323 {
324 test_set_page_writeback(page); 324 test_set_page_writeback(page);
325 } 325 }
326 326
327 #ifdef CONFIG_PAGEFLAGS_EXTENDED 327 #ifdef CONFIG_PAGEFLAGS_EXTENDED
328 /* 328 /*
329 * System with lots of page flags available. This allows separate 329 * System with lots of page flags available. This allows separate
330 * flags for PageHead() and PageTail() checks of compound pages so that bit 330 * flags for PageHead() and PageTail() checks of compound pages so that bit
331 * tests can be used in performance sensitive paths. PageCompound is 331 * tests can be used in performance sensitive paths. PageCompound is
332 * generally not used in hot code paths except arch/powerpc/mm/init_64.c 332 * generally not used in hot code paths except arch/powerpc/mm/init_64.c
333 * and arch/powerpc/kvm/book3s_64_vio_hv.c which use it to detect huge pages 333 * and arch/powerpc/kvm/book3s_64_vio_hv.c which use it to detect huge pages
334 * and avoid handling those in real mode. 334 * and avoid handling those in real mode.
335 */ 335 */
336 __PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head) 336 __PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
337 __PAGEFLAG(Tail, tail) 337 __PAGEFLAG(Tail, tail)
338 338
339 static inline int PageCompound(struct page *page) 339 static inline int PageCompound(struct page *page)
340 { 340 {
341 return page->flags & ((1L << PG_head) | (1L << PG_tail)); 341 return page->flags & ((1L << PG_head) | (1L << PG_tail));
342 342
343 } 343 }
344 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 344 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
345 static inline void ClearPageCompound(struct page *page) 345 static inline void ClearPageCompound(struct page *page)
346 { 346 {
347 BUG_ON(!PageHead(page)); 347 BUG_ON(!PageHead(page));
348 ClearPageHead(page); 348 ClearPageHead(page);
349 } 349 }
350 #endif 350 #endif
351 #else 351 #else
352 /* 352 /*
353 * Reduce page flag use as much as possible by overlapping 353 * Reduce page flag use as much as possible by overlapping
354 * compound page flags with the flags used for page cache pages. Possible 354 * compound page flags with the flags used for page cache pages. Possible
355 * because PageCompound is always set for compound pages and not for 355 * because PageCompound is always set for compound pages and not for
356 * pages on the LRU and/or pagecache. 356 * pages on the LRU and/or pagecache.
357 */ 357 */
358 TESTPAGEFLAG(Compound, compound) 358 TESTPAGEFLAG(Compound, compound)
359 __SETPAGEFLAG(Head, compound) __CLEARPAGEFLAG(Head, compound) 359 __SETPAGEFLAG(Head, compound) __CLEARPAGEFLAG(Head, compound)
360 360
361 /* 361 /*
362 * PG_reclaim is used in combination with PG_compound to mark the 362 * PG_reclaim is used in combination with PG_compound to mark the
363 * head and tail of a compound page. This saves one page flag 363 * head and tail of a compound page. This saves one page flag
364 * but makes it impossible to use compound pages for the page cache. 364 * but makes it impossible to use compound pages for the page cache.
365 * The PG_reclaim bit would have to be used for reclaim or readahead 365 * The PG_reclaim bit would have to be used for reclaim or readahead
366 * if compound pages enter the page cache. 366 * if compound pages enter the page cache.
367 * 367 *
368 * PG_compound & PG_reclaim => Tail page 368 * PG_compound & PG_reclaim => Tail page
369 * PG_compound & ~PG_reclaim => Head page 369 * PG_compound & ~PG_reclaim => Head page
370 */ 370 */
371 #define PG_head_mask ((1L << PG_compound)) 371 #define PG_head_mask ((1L << PG_compound))
372 #define PG_head_tail_mask ((1L << PG_compound) | (1L << PG_reclaim)) 372 #define PG_head_tail_mask ((1L << PG_compound) | (1L << PG_reclaim))
373 373
374 static inline int PageHead(struct page *page) 374 static inline int PageHead(struct page *page)
375 { 375 {
376 return ((page->flags & PG_head_tail_mask) == PG_head_mask); 376 return ((page->flags & PG_head_tail_mask) == PG_head_mask);
377 } 377 }
378 378
379 static inline int PageTail(struct page *page) 379 static inline int PageTail(struct page *page)
380 { 380 {
381 return ((page->flags & PG_head_tail_mask) == PG_head_tail_mask); 381 return ((page->flags & PG_head_tail_mask) == PG_head_tail_mask);
382 } 382 }
383 383
384 static inline void __SetPageTail(struct page *page) 384 static inline void __SetPageTail(struct page *page)
385 { 385 {
386 page->flags |= PG_head_tail_mask; 386 page->flags |= PG_head_tail_mask;
387 } 387 }
388 388
389 static inline void __ClearPageTail(struct page *page) 389 static inline void __ClearPageTail(struct page *page)
390 { 390 {
391 page->flags &= ~PG_head_tail_mask; 391 page->flags &= ~PG_head_tail_mask;
392 } 392 }
393 393
394 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 394 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
395 static inline void ClearPageCompound(struct page *page) 395 static inline void ClearPageCompound(struct page *page)
396 { 396 {
397 BUG_ON((page->flags & PG_head_tail_mask) != (1 << PG_compound)); 397 BUG_ON((page->flags & PG_head_tail_mask) != (1 << PG_compound));
398 clear_bit(PG_compound, &page->flags); 398 clear_bit(PG_compound, &page->flags);
399 } 399 }
400 #endif 400 #endif
401 401
402 #endif /* !PAGEFLAGS_EXTENDED */ 402 #endif /* !PAGEFLAGS_EXTENDED */
403 403
404 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 404 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
405 /* 405 /*
406 * PageHuge() only returns true for hugetlbfs pages, but not for 406 * PageHuge() only returns true for hugetlbfs pages, but not for
407 * normal or transparent huge pages. 407 * normal or transparent huge pages.
408 * 408 *
409 * PageTransHuge() returns true for both transparent huge and 409 * PageTransHuge() returns true for both transparent huge and
410 * hugetlbfs pages, but not normal pages. PageTransHuge() can only be 410 * hugetlbfs pages, but not normal pages. PageTransHuge() can only be
411 * called only in the core VM paths where hugetlbfs pages can't exist. 411 * called only in the core VM paths where hugetlbfs pages can't exist.
412 */ 412 */
413 static inline int PageTransHuge(struct page *page) 413 static inline int PageTransHuge(struct page *page)
414 { 414 {
415 VM_BUG_ON_PAGE(PageTail(page), page); 415 VM_BUG_ON_PAGE(PageTail(page), page);
416 return PageHead(page); 416 return PageHead(page);
417 } 417 }
418 418
419 /* 419 /*
420 * PageTransCompound returns true for both transparent huge pages 420 * PageTransCompound returns true for both transparent huge pages
421 * and hugetlbfs pages, so it should only be called when it's known 421 * and hugetlbfs pages, so it should only be called when it's known
422 * that hugetlbfs pages aren't involved. 422 * that hugetlbfs pages aren't involved.
423 */ 423 */
424 static inline int PageTransCompound(struct page *page) 424 static inline int PageTransCompound(struct page *page)
425 { 425 {
426 return PageCompound(page); 426 return PageCompound(page);
427 } 427 }
428 428
429 /* 429 /*
430 * PageTransTail returns true for both transparent huge pages 430 * PageTransTail returns true for both transparent huge pages
431 * and hugetlbfs pages, so it should only be called when it's known 431 * and hugetlbfs pages, so it should only be called when it's known
432 * that hugetlbfs pages aren't involved. 432 * that hugetlbfs pages aren't involved.
433 */ 433 */
434 static inline int PageTransTail(struct page *page) 434 static inline int PageTransTail(struct page *page)
435 { 435 {
436 return PageTail(page); 436 return PageTail(page);
437 } 437 }
438 438
439 #else 439 #else
440 440
441 static inline int PageTransHuge(struct page *page) 441 static inline int PageTransHuge(struct page *page)
442 { 442 {
443 return 0; 443 return 0;
444 } 444 }
445 445
446 static inline int PageTransCompound(struct page *page) 446 static inline int PageTransCompound(struct page *page)
447 { 447 {
448 return 0; 448 return 0;
449 } 449 }
450 450
451 static inline int PageTransTail(struct page *page) 451 static inline int PageTransTail(struct page *page)
452 { 452 {
453 return 0; 453 return 0;
454 } 454 }
455 #endif 455 #endif
456 456
457 /* 457 /*
458 * If network-based swap is enabled, sl*b must keep track of whether pages 458 * If network-based swap is enabled, sl*b must keep track of whether pages
459 * were allocated from pfmemalloc reserves. 459 * were allocated from pfmemalloc reserves.
460 */ 460 */
461 static inline int PageSlabPfmemalloc(struct page *page) 461 static inline int PageSlabPfmemalloc(struct page *page)
462 { 462 {
463 VM_BUG_ON_PAGE(!PageSlab(page), page); 463 VM_BUG_ON_PAGE(!PageSlab(page), page);
464 return PageActive(page); 464 return PageActive(page);
465 } 465 }
466 466
467 static inline void SetPageSlabPfmemalloc(struct page *page) 467 static inline void SetPageSlabPfmemalloc(struct page *page)
468 { 468 {
469 VM_BUG_ON_PAGE(!PageSlab(page), page); 469 VM_BUG_ON_PAGE(!PageSlab(page), page);
470 SetPageActive(page); 470 SetPageActive(page);
471 } 471 }
472 472
473 static inline void __ClearPageSlabPfmemalloc(struct page *page) 473 static inline void __ClearPageSlabPfmemalloc(struct page *page)
474 { 474 {
475 VM_BUG_ON_PAGE(!PageSlab(page), page); 475 VM_BUG_ON_PAGE(!PageSlab(page), page);
476 __ClearPageActive(page); 476 __ClearPageActive(page);
477 } 477 }
478 478
479 static inline void ClearPageSlabPfmemalloc(struct page *page) 479 static inline void ClearPageSlabPfmemalloc(struct page *page)
480 { 480 {
481 VM_BUG_ON_PAGE(!PageSlab(page), page); 481 VM_BUG_ON_PAGE(!PageSlab(page), page);
482 ClearPageActive(page); 482 ClearPageActive(page);
483 } 483 }
484 484
485 #ifdef CONFIG_MMU 485 #ifdef CONFIG_MMU
486 #define __PG_MLOCKED (1 << PG_mlocked) 486 #define __PG_MLOCKED (1 << PG_mlocked)
487 #else 487 #else
488 #define __PG_MLOCKED 0 488 #define __PG_MLOCKED 0
489 #endif 489 #endif
490 490
491 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 491 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
492 #define __PG_COMPOUND_LOCK (1 << PG_compound_lock) 492 #define __PG_COMPOUND_LOCK (1 << PG_compound_lock)
493 #else 493 #else
494 #define __PG_COMPOUND_LOCK 0 494 #define __PG_COMPOUND_LOCK 0
495 #endif 495 #endif
496 496
497 /* 497 /*
498 * Flags checked when a page is freed. Pages being freed should not have 498 * Flags checked when a page is freed. Pages being freed should not have
499 * these flags set. It they are, there is a problem. 499 * these flags set. It they are, there is a problem.
500 */ 500 */
501 #define PAGE_FLAGS_CHECK_AT_FREE \ 501 #define PAGE_FLAGS_CHECK_AT_FREE \
502 (1 << PG_lru | 1 << PG_locked | \ 502 (1 << PG_lru | 1 << PG_locked | \
503 1 << PG_private | 1 << PG_private_2 | \ 503 1 << PG_private | 1 << PG_private_2 | \
504 1 << PG_writeback | 1 << PG_reserved | \ 504 1 << PG_writeback | 1 << PG_reserved | \
505 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ 505 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \
506 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ 506 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
507 __PG_COMPOUND_LOCK) 507 __PG_COMPOUND_LOCK)
508 508
509 /* 509 /*
510 * Flags checked when a page is prepped for return by the page allocator. 510 * Flags checked when a page is prepped for return by the page allocator.
511 * Pages being prepped should not have any flags set. It they are set, 511 * Pages being prepped should not have any flags set. It they are set,
512 * there has been a kernel bug or struct page corruption. 512 * there has been a kernel bug or struct page corruption.
513 */ 513 */
514 #define PAGE_FLAGS_CHECK_AT_PREP ((1 << NR_PAGEFLAGS) - 1) 514 #define PAGE_FLAGS_CHECK_AT_PREP ((1 << NR_PAGEFLAGS) - 1)
515 515
516 #define PAGE_FLAGS_PRIVATE \ 516 #define PAGE_FLAGS_PRIVATE \
517 (1 << PG_private | 1 << PG_private_2) 517 (1 << PG_private | 1 << PG_private_2)
518 /** 518 /**
519 * page_has_private - Determine if page has private stuff 519 * page_has_private - Determine if page has private stuff
520 * @page: The page to be checked 520 * @page: The page to be checked
521 * 521 *
522 * Determine if a page has private stuff, indicating that release routines 522 * Determine if a page has private stuff, indicating that release routines
523 * should be invoked upon it. 523 * should be invoked upon it.
524 */ 524 */
525 static inline int page_has_private(struct page *page) 525 static inline int page_has_private(struct page *page)
526 { 526 {
527 return !!(page->flags & PAGE_FLAGS_PRIVATE); 527 return !!(page->flags & PAGE_FLAGS_PRIVATE);
528 } 528 }
529 529
530 #endif /* !__GENERATING_BOUNDS_H */ 530 #endif /* !__GENERATING_BOUNDS_H */
531 531
532 #endif /* PAGE_FLAGS_H */ 532 #endif /* PAGE_FLAGS_H */
533 533
1 /* 1 /*
2 * linux/mm/swap_state.c 2 * linux/mm/swap_state.c
3 * 3 *
4 * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds 4 * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds
5 * Swap reorganised 29.12.95, Stephen Tweedie 5 * Swap reorganised 29.12.95, Stephen Tweedie
6 * 6 *
7 * Rewritten to use page cache, (C) 1998 Stephen Tweedie 7 * Rewritten to use page cache, (C) 1998 Stephen Tweedie
8 */ 8 */
9 #include <linux/mm.h> 9 #include <linux/mm.h>
10 #include <linux/gfp.h> 10 #include <linux/gfp.h>
11 #include <linux/kernel_stat.h> 11 #include <linux/kernel_stat.h>
12 #include <linux/swap.h> 12 #include <linux/swap.h>
13 #include <linux/swapops.h> 13 #include <linux/swapops.h>
14 #include <linux/init.h> 14 #include <linux/init.h>
15 #include <linux/pagemap.h> 15 #include <linux/pagemap.h>
16 #include <linux/backing-dev.h> 16 #include <linux/backing-dev.h>
17 #include <linux/blkdev.h> 17 #include <linux/blkdev.h>
18 #include <linux/pagevec.h> 18 #include <linux/pagevec.h>
19 #include <linux/migrate.h> 19 #include <linux/migrate.h>
20 #include <linux/page_cgroup.h> 20 #include <linux/page_cgroup.h>
21 21
22 #include <asm/pgtable.h> 22 #include <asm/pgtable.h>
23 23
24 /* 24 /*
25 * swapper_space is a fiction, retained to simplify the path through 25 * swapper_space is a fiction, retained to simplify the path through
26 * vmscan's shrink_page_list. 26 * vmscan's shrink_page_list.
27 */ 27 */
28 static const struct address_space_operations swap_aops = { 28 static const struct address_space_operations swap_aops = {
29 .writepage = swap_writepage, 29 .writepage = swap_writepage,
30 .set_page_dirty = swap_set_page_dirty, 30 .set_page_dirty = swap_set_page_dirty,
31 .migratepage = migrate_page, 31 .migratepage = migrate_page,
32 }; 32 };
33 33
34 static struct backing_dev_info swap_backing_dev_info = { 34 static struct backing_dev_info swap_backing_dev_info = {
35 .name = "swap", 35 .name = "swap",
36 .capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK | BDI_CAP_SWAP_BACKED, 36 .capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK | BDI_CAP_SWAP_BACKED,
37 }; 37 };
38 38
39 struct address_space swapper_spaces[MAX_SWAPFILES] = { 39 struct address_space swapper_spaces[MAX_SWAPFILES] = {
40 [0 ... MAX_SWAPFILES - 1] = { 40 [0 ... MAX_SWAPFILES - 1] = {
41 .page_tree = RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN), 41 .page_tree = RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
42 .a_ops = &swap_aops, 42 .a_ops = &swap_aops,
43 .backing_dev_info = &swap_backing_dev_info, 43 .backing_dev_info = &swap_backing_dev_info,
44 } 44 }
45 }; 45 };
46 46
47 #define INC_CACHE_INFO(x) do { swap_cache_info.x++; } while (0) 47 #define INC_CACHE_INFO(x) do { swap_cache_info.x++; } while (0)
48 48
49 static struct { 49 static struct {
50 unsigned long add_total; 50 unsigned long add_total;
51 unsigned long del_total; 51 unsigned long del_total;
52 unsigned long find_success; 52 unsigned long find_success;
53 unsigned long find_total; 53 unsigned long find_total;
54 } swap_cache_info; 54 } swap_cache_info;
55 55
56 unsigned long total_swapcache_pages(void) 56 unsigned long total_swapcache_pages(void)
57 { 57 {
58 int i; 58 int i;
59 unsigned long ret = 0; 59 unsigned long ret = 0;
60 60
61 for (i = 0; i < MAX_SWAPFILES; i++) 61 for (i = 0; i < MAX_SWAPFILES; i++)
62 ret += swapper_spaces[i].nrpages; 62 ret += swapper_spaces[i].nrpages;
63 return ret; 63 return ret;
64 } 64 }
65 65
66 static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
67
66 void show_swap_cache_info(void) 68 void show_swap_cache_info(void)
67 { 69 {
68 printk("%lu pages in swap cache\n", total_swapcache_pages()); 70 printk("%lu pages in swap cache\n", total_swapcache_pages());
69 printk("Swap cache stats: add %lu, delete %lu, find %lu/%lu\n", 71 printk("Swap cache stats: add %lu, delete %lu, find %lu/%lu\n",
70 swap_cache_info.add_total, swap_cache_info.del_total, 72 swap_cache_info.add_total, swap_cache_info.del_total,
71 swap_cache_info.find_success, swap_cache_info.find_total); 73 swap_cache_info.find_success, swap_cache_info.find_total);
72 printk("Free swap = %ldkB\n", 74 printk("Free swap = %ldkB\n",
73 get_nr_swap_pages() << (PAGE_SHIFT - 10)); 75 get_nr_swap_pages() << (PAGE_SHIFT - 10));
74 printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10)); 76 printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10));
75 } 77 }
76 78
77 /* 79 /*
78 * __add_to_swap_cache resembles add_to_page_cache_locked on swapper_space, 80 * __add_to_swap_cache resembles add_to_page_cache_locked on swapper_space,
79 * but sets SwapCache flag and private instead of mapping and index. 81 * but sets SwapCache flag and private instead of mapping and index.
80 */ 82 */
81 int __add_to_swap_cache(struct page *page, swp_entry_t entry) 83 int __add_to_swap_cache(struct page *page, swp_entry_t entry)
82 { 84 {
83 int error; 85 int error;
84 struct address_space *address_space; 86 struct address_space *address_space;
85 87
86 VM_BUG_ON_PAGE(!PageLocked(page), page); 88 VM_BUG_ON_PAGE(!PageLocked(page), page);
87 VM_BUG_ON_PAGE(PageSwapCache(page), page); 89 VM_BUG_ON_PAGE(PageSwapCache(page), page);
88 VM_BUG_ON_PAGE(!PageSwapBacked(page), page); 90 VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
89 91
90 page_cache_get(page); 92 page_cache_get(page);
91 SetPageSwapCache(page); 93 SetPageSwapCache(page);
92 set_page_private(page, entry.val); 94 set_page_private(page, entry.val);
93 95
94 address_space = swap_address_space(entry); 96 address_space = swap_address_space(entry);
95 spin_lock_irq(&address_space->tree_lock); 97 spin_lock_irq(&address_space->tree_lock);
96 error = radix_tree_insert(&address_space->page_tree, 98 error = radix_tree_insert(&address_space->page_tree,
97 entry.val, page); 99 entry.val, page);
98 if (likely(!error)) { 100 if (likely(!error)) {
99 address_space->nrpages++; 101 address_space->nrpages++;
100 __inc_zone_page_state(page, NR_FILE_PAGES); 102 __inc_zone_page_state(page, NR_FILE_PAGES);
101 INC_CACHE_INFO(add_total); 103 INC_CACHE_INFO(add_total);
102 } 104 }
103 spin_unlock_irq(&address_space->tree_lock); 105 spin_unlock_irq(&address_space->tree_lock);
104 106
105 if (unlikely(error)) { 107 if (unlikely(error)) {
106 /* 108 /*
107 * Only the context which have set SWAP_HAS_CACHE flag 109 * Only the context which have set SWAP_HAS_CACHE flag
108 * would call add_to_swap_cache(). 110 * would call add_to_swap_cache().
109 * So add_to_swap_cache() doesn't returns -EEXIST. 111 * So add_to_swap_cache() doesn't returns -EEXIST.
110 */ 112 */
111 VM_BUG_ON(error == -EEXIST); 113 VM_BUG_ON(error == -EEXIST);
112 set_page_private(page, 0UL); 114 set_page_private(page, 0UL);
113 ClearPageSwapCache(page); 115 ClearPageSwapCache(page);
114 page_cache_release(page); 116 page_cache_release(page);
115 } 117 }
116 118
117 return error; 119 return error;
118 } 120 }
119 121
120 122
121 int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask) 123 int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
122 { 124 {
123 int error; 125 int error;
124 126
125 error = radix_tree_maybe_preload(gfp_mask); 127 error = radix_tree_maybe_preload(gfp_mask);
126 if (!error) { 128 if (!error) {
127 error = __add_to_swap_cache(page, entry); 129 error = __add_to_swap_cache(page, entry);
128 radix_tree_preload_end(); 130 radix_tree_preload_end();
129 } 131 }
130 return error; 132 return error;
131 } 133 }
132 134
133 /* 135 /*
134 * This must be called only on pages that have 136 * This must be called only on pages that have
135 * been verified to be in the swap cache. 137 * been verified to be in the swap cache.
136 */ 138 */
137 void __delete_from_swap_cache(struct page *page) 139 void __delete_from_swap_cache(struct page *page)
138 { 140 {
139 swp_entry_t entry; 141 swp_entry_t entry;
140 struct address_space *address_space; 142 struct address_space *address_space;
141 143
142 VM_BUG_ON_PAGE(!PageLocked(page), page); 144 VM_BUG_ON_PAGE(!PageLocked(page), page);
143 VM_BUG_ON_PAGE(!PageSwapCache(page), page); 145 VM_BUG_ON_PAGE(!PageSwapCache(page), page);
144 VM_BUG_ON_PAGE(PageWriteback(page), page); 146 VM_BUG_ON_PAGE(PageWriteback(page), page);
145 147
146 entry.val = page_private(page); 148 entry.val = page_private(page);
147 address_space = swap_address_space(entry); 149 address_space = swap_address_space(entry);
148 radix_tree_delete(&address_space->page_tree, page_private(page)); 150 radix_tree_delete(&address_space->page_tree, page_private(page));
149 set_page_private(page, 0); 151 set_page_private(page, 0);
150 ClearPageSwapCache(page); 152 ClearPageSwapCache(page);
151 address_space->nrpages--; 153 address_space->nrpages--;
152 __dec_zone_page_state(page, NR_FILE_PAGES); 154 __dec_zone_page_state(page, NR_FILE_PAGES);
153 INC_CACHE_INFO(del_total); 155 INC_CACHE_INFO(del_total);
154 } 156 }
155 157
156 /** 158 /**
157 * add_to_swap - allocate swap space for a page 159 * add_to_swap - allocate swap space for a page
158 * @page: page we want to move to swap 160 * @page: page we want to move to swap
159 * 161 *
160 * Allocate swap space for the page and add the page to the 162 * Allocate swap space for the page and add the page to the
161 * swap cache. Caller needs to hold the page lock. 163 * swap cache. Caller needs to hold the page lock.
162 */ 164 */
163 int add_to_swap(struct page *page, struct list_head *list) 165 int add_to_swap(struct page *page, struct list_head *list)
164 { 166 {
165 swp_entry_t entry; 167 swp_entry_t entry;
166 int err; 168 int err;
167 169
168 VM_BUG_ON_PAGE(!PageLocked(page), page); 170 VM_BUG_ON_PAGE(!PageLocked(page), page);
169 VM_BUG_ON_PAGE(!PageUptodate(page), page); 171 VM_BUG_ON_PAGE(!PageUptodate(page), page);
170 172
171 entry = get_swap_page(); 173 entry = get_swap_page();
172 if (!entry.val) 174 if (!entry.val)
173 return 0; 175 return 0;
174 176
175 if (unlikely(PageTransHuge(page))) 177 if (unlikely(PageTransHuge(page)))
176 if (unlikely(split_huge_page_to_list(page, list))) { 178 if (unlikely(split_huge_page_to_list(page, list))) {
177 swapcache_free(entry, NULL); 179 swapcache_free(entry, NULL);
178 return 0; 180 return 0;
179 } 181 }
180 182
181 /* 183 /*
182 * Radix-tree node allocations from PF_MEMALLOC contexts could 184 * Radix-tree node allocations from PF_MEMALLOC contexts could
183 * completely exhaust the page allocator. __GFP_NOMEMALLOC 185 * completely exhaust the page allocator. __GFP_NOMEMALLOC
184 * stops emergency reserves from being allocated. 186 * stops emergency reserves from being allocated.
185 * 187 *
186 * TODO: this could cause a theoretical memory reclaim 188 * TODO: this could cause a theoretical memory reclaim
187 * deadlock in the swap out path. 189 * deadlock in the swap out path.
188 */ 190 */
189 /* 191 /*
190 * Add it to the swap cache and mark it dirty 192 * Add it to the swap cache and mark it dirty
191 */ 193 */
192 err = add_to_swap_cache(page, entry, 194 err = add_to_swap_cache(page, entry,
193 __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN); 195 __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
194 196
195 if (!err) { /* Success */ 197 if (!err) { /* Success */
196 SetPageDirty(page); 198 SetPageDirty(page);
197 return 1; 199 return 1;
198 } else { /* -ENOMEM radix-tree allocation failure */ 200 } else { /* -ENOMEM radix-tree allocation failure */
199 /* 201 /*
200 * add_to_swap_cache() doesn't return -EEXIST, so we can safely 202 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
201 * clear SWAP_HAS_CACHE flag. 203 * clear SWAP_HAS_CACHE flag.
202 */ 204 */
203 swapcache_free(entry, NULL); 205 swapcache_free(entry, NULL);
204 return 0; 206 return 0;
205 } 207 }
206 } 208 }
207 209
208 /* 210 /*
209 * This must be called only on pages that have 211 * This must be called only on pages that have
210 * been verified to be in the swap cache and locked. 212 * been verified to be in the swap cache and locked.
211 * It will never put the page into the free list, 213 * It will never put the page into the free list,
212 * the caller has a reference on the page. 214 * the caller has a reference on the page.
213 */ 215 */
214 void delete_from_swap_cache(struct page *page) 216 void delete_from_swap_cache(struct page *page)
215 { 217 {
216 swp_entry_t entry; 218 swp_entry_t entry;
217 struct address_space *address_space; 219 struct address_space *address_space;
218 220
219 entry.val = page_private(page); 221 entry.val = page_private(page);
220 222
221 address_space = swap_address_space(entry); 223 address_space = swap_address_space(entry);
222 spin_lock_irq(&address_space->tree_lock); 224 spin_lock_irq(&address_space->tree_lock);
223 __delete_from_swap_cache(page); 225 __delete_from_swap_cache(page);
224 spin_unlock_irq(&address_space->tree_lock); 226 spin_unlock_irq(&address_space->tree_lock);
225 227
226 swapcache_free(entry, page); 228 swapcache_free(entry, page);
227 page_cache_release(page); 229 page_cache_release(page);
228 } 230 }
229 231
230 /* 232 /*
231 * If we are the only user, then try to free up the swap cache. 233 * If we are the only user, then try to free up the swap cache.
232 * 234 *
233 * Its ok to check for PageSwapCache without the page lock 235 * Its ok to check for PageSwapCache without the page lock
234 * here because we are going to recheck again inside 236 * here because we are going to recheck again inside
235 * try_to_free_swap() _with_ the lock. 237 * try_to_free_swap() _with_ the lock.
236 * - Marcelo 238 * - Marcelo
237 */ 239 */
238 static inline void free_swap_cache(struct page *page) 240 static inline void free_swap_cache(struct page *page)
239 { 241 {
240 if (PageSwapCache(page) && !page_mapped(page) && trylock_page(page)) { 242 if (PageSwapCache(page) && !page_mapped(page) && trylock_page(page)) {
241 try_to_free_swap(page); 243 try_to_free_swap(page);
242 unlock_page(page); 244 unlock_page(page);
243 } 245 }
244 } 246 }
245 247
246 /* 248 /*
247 * Perform a free_page(), also freeing any swap cache associated with 249 * Perform a free_page(), also freeing any swap cache associated with
248 * this page if it is the last user of the page. 250 * this page if it is the last user of the page.
249 */ 251 */
250 void free_page_and_swap_cache(struct page *page) 252 void free_page_and_swap_cache(struct page *page)
251 { 253 {
252 free_swap_cache(page); 254 free_swap_cache(page);
253 page_cache_release(page); 255 page_cache_release(page);
254 } 256 }
255 257
256 /* 258 /*
257 * Passed an array of pages, drop them all from swapcache and then release 259 * Passed an array of pages, drop them all from swapcache and then release
258 * them. They are removed from the LRU and freed if this is their last use. 260 * them. They are removed from the LRU and freed if this is their last use.
259 */ 261 */
260 void free_pages_and_swap_cache(struct page **pages, int nr) 262 void free_pages_and_swap_cache(struct page **pages, int nr)
261 { 263 {
262 struct page **pagep = pages; 264 struct page **pagep = pages;
263 265
264 lru_add_drain(); 266 lru_add_drain();
265 while (nr) { 267 while (nr) {
266 int todo = min(nr, PAGEVEC_SIZE); 268 int todo = min(nr, PAGEVEC_SIZE);
267 int i; 269 int i;
268 270
269 for (i = 0; i < todo; i++) 271 for (i = 0; i < todo; i++)
270 free_swap_cache(pagep[i]); 272 free_swap_cache(pagep[i]);
271 release_pages(pagep, todo, 0); 273 release_pages(pagep, todo, 0);
272 pagep += todo; 274 pagep += todo;
273 nr -= todo; 275 nr -= todo;
274 } 276 }
275 } 277 }
276 278
277 /* 279 /*
278 * Lookup a swap entry in the swap cache. A found page will be returned 280 * Lookup a swap entry in the swap cache. A found page will be returned
279 * unlocked and with its refcount incremented - we rely on the kernel 281 * unlocked and with its refcount incremented - we rely on the kernel
280 * lock getting page table operations atomic even if we drop the page 282 * lock getting page table operations atomic even if we drop the page
281 * lock before returning. 283 * lock before returning.
282 */ 284 */
283 struct page * lookup_swap_cache(swp_entry_t entry) 285 struct page * lookup_swap_cache(swp_entry_t entry)
284 { 286 {
285 struct page *page; 287 struct page *page;
286 288
287 page = find_get_page(swap_address_space(entry), entry.val); 289 page = find_get_page(swap_address_space(entry), entry.val);
288 290
289 if (page) 291 if (page) {
290 INC_CACHE_INFO(find_success); 292 INC_CACHE_INFO(find_success);
293 if (TestClearPageReadahead(page))
294 atomic_inc(&swapin_readahead_hits);
295 }
291 296
292 INC_CACHE_INFO(find_total); 297 INC_CACHE_INFO(find_total);
293 return page; 298 return page;
294 } 299 }
295 300
296 /* 301 /*
297 * Locate a page of swap in physical memory, reserving swap cache space 302 * Locate a page of swap in physical memory, reserving swap cache space
298 * and reading the disk if it is not already cached. 303 * and reading the disk if it is not already cached.
299 * A failure return means that either the page allocation failed or that 304 * A failure return means that either the page allocation failed or that
300 * the swap entry is no longer in use. 305 * the swap entry is no longer in use.
301 */ 306 */
302 struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, 307 struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
303 struct vm_area_struct *vma, unsigned long addr) 308 struct vm_area_struct *vma, unsigned long addr)
304 { 309 {
305 struct page *found_page, *new_page = NULL; 310 struct page *found_page, *new_page = NULL;
306 int err; 311 int err;
307 312
308 do { 313 do {
309 /* 314 /*
310 * First check the swap cache. Since this is normally 315 * First check the swap cache. Since this is normally
311 * called after lookup_swap_cache() failed, re-calling 316 * called after lookup_swap_cache() failed, re-calling
312 * that would confuse statistics. 317 * that would confuse statistics.
313 */ 318 */
314 found_page = find_get_page(swap_address_space(entry), 319 found_page = find_get_page(swap_address_space(entry),
315 entry.val); 320 entry.val);
316 if (found_page) 321 if (found_page)
317 break; 322 break;
318 323
319 /* 324 /*
320 * Get a new page to read into from swap. 325 * Get a new page to read into from swap.
321 */ 326 */
322 if (!new_page) { 327 if (!new_page) {
323 new_page = alloc_page_vma(gfp_mask, vma, addr); 328 new_page = alloc_page_vma(gfp_mask, vma, addr);
324 if (!new_page) 329 if (!new_page)
325 break; /* Out of memory */ 330 break; /* Out of memory */
326 } 331 }
327 332
328 /* 333 /*
329 * call radix_tree_preload() while we can wait. 334 * call radix_tree_preload() while we can wait.
330 */ 335 */
331 err = radix_tree_maybe_preload(gfp_mask & GFP_KERNEL); 336 err = radix_tree_maybe_preload(gfp_mask & GFP_KERNEL);
332 if (err) 337 if (err)
333 break; 338 break;
334 339
335 /* 340 /*
336 * Swap entry may have been freed since our caller observed it. 341 * Swap entry may have been freed since our caller observed it.
337 */ 342 */
338 err = swapcache_prepare(entry); 343 err = swapcache_prepare(entry);
339 if (err == -EEXIST) { 344 if (err == -EEXIST) {
340 radix_tree_preload_end(); 345 radix_tree_preload_end();
341 /* 346 /*
342 * We might race against get_swap_page() and stumble 347 * We might race against get_swap_page() and stumble
343 * across a SWAP_HAS_CACHE swap_map entry whose page 348 * across a SWAP_HAS_CACHE swap_map entry whose page
344 * has not been brought into the swapcache yet, while 349 * has not been brought into the swapcache yet, while
345 * the other end is scheduled away waiting on discard 350 * the other end is scheduled away waiting on discard
346 * I/O completion at scan_swap_map(). 351 * I/O completion at scan_swap_map().
347 * 352 *
348 * In order to avoid turning this transitory state 353 * In order to avoid turning this transitory state
349 * into a permanent loop around this -EEXIST case 354 * into a permanent loop around this -EEXIST case
350 * if !CONFIG_PREEMPT and the I/O completion happens 355 * if !CONFIG_PREEMPT and the I/O completion happens
351 * to be waiting on the CPU waitqueue where we are now 356 * to be waiting on the CPU waitqueue where we are now
352 * busy looping, we just conditionally invoke the 357 * busy looping, we just conditionally invoke the
353 * scheduler here, if there are some more important 358 * scheduler here, if there are some more important
354 * tasks to run. 359 * tasks to run.
355 */ 360 */
356 cond_resched(); 361 cond_resched();
357 continue; 362 continue;
358 } 363 }
359 if (err) { /* swp entry is obsolete ? */ 364 if (err) { /* swp entry is obsolete ? */
360 radix_tree_preload_end(); 365 radix_tree_preload_end();
361 break; 366 break;
362 } 367 }
363 368
364 /* May fail (-ENOMEM) if radix-tree node allocation failed. */ 369 /* May fail (-ENOMEM) if radix-tree node allocation failed. */
365 __set_page_locked(new_page); 370 __set_page_locked(new_page);
366 SetPageSwapBacked(new_page); 371 SetPageSwapBacked(new_page);
367 err = __add_to_swap_cache(new_page, entry); 372 err = __add_to_swap_cache(new_page, entry);
368 if (likely(!err)) { 373 if (likely(!err)) {
369 radix_tree_preload_end(); 374 radix_tree_preload_end();
370 /* 375 /*
371 * Initiate read into locked page and return. 376 * Initiate read into locked page and return.
372 */ 377 */
373 lru_cache_add_anon(new_page); 378 lru_cache_add_anon(new_page);
374 swap_readpage(new_page); 379 swap_readpage(new_page);
375 return new_page; 380 return new_page;
376 } 381 }
377 radix_tree_preload_end(); 382 radix_tree_preload_end();
378 ClearPageSwapBacked(new_page); 383 ClearPageSwapBacked(new_page);
379 __clear_page_locked(new_page); 384 __clear_page_locked(new_page);
380 /* 385 /*
381 * add_to_swap_cache() doesn't return -EEXIST, so we can safely 386 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
382 * clear SWAP_HAS_CACHE flag. 387 * clear SWAP_HAS_CACHE flag.
383 */ 388 */
384 swapcache_free(entry, NULL); 389 swapcache_free(entry, NULL);
385 } while (err != -ENOMEM); 390 } while (err != -ENOMEM);
386 391
387 if (new_page) 392 if (new_page)
388 page_cache_release(new_page); 393 page_cache_release(new_page);
389 return found_page; 394 return found_page;
390 } 395 }
391 396
397 static unsigned long swapin_nr_pages(unsigned long offset)
398 {
399 static unsigned long prev_offset;
400 unsigned int pages, max_pages, last_ra;
401 static atomic_t last_readahead_pages;
402
403 max_pages = 1 << ACCESS_ONCE(page_cluster);
404 if (max_pages <= 1)
405 return 1;
406
407 /*
408 * This heuristic has been found to work well on both sequential and
409 * random loads, swapping to hard disk or to SSD: please don't ask
410 * what the "+ 2" means, it just happens to work well, that's all.
411 */
412 pages = atomic_xchg(&swapin_readahead_hits, 0) + 2;
413 if (pages == 2) {
414 /*
415 * We can have no readahead hits to judge by: but must not get
416 * stuck here forever, so check for an adjacent offset instead
417 * (and don't even bother to check whether swap type is same).
418 */
419 if (offset != prev_offset + 1 && offset != prev_offset - 1)
420 pages = 1;
421 prev_offset = offset;
422 } else {
423 unsigned int roundup = 4;
424 while (roundup < pages)
425 roundup <<= 1;
426 pages = roundup;
427 }
428
429 if (pages > max_pages)
430 pages = max_pages;
431
432 /* Don't shrink readahead too fast */
433 last_ra = atomic_read(&last_readahead_pages) / 2;
434 if (pages < last_ra)
435 pages = last_ra;
436 atomic_set(&last_readahead_pages, pages);
437
438 return pages;
439 }
440
392 /** 441 /**
393 * swapin_readahead - swap in pages in hope we need them soon 442 * swapin_readahead - swap in pages in hope we need them soon
394 * @entry: swap entry of this memory 443 * @entry: swap entry of this memory
395 * @gfp_mask: memory allocation flags 444 * @gfp_mask: memory allocation flags
396 * @vma: user vma this address belongs to 445 * @vma: user vma this address belongs to
397 * @addr: target address for mempolicy 446 * @addr: target address for mempolicy
398 * 447 *
399 * Returns the struct page for entry and addr, after queueing swapin. 448 * Returns the struct page for entry and addr, after queueing swapin.
400 * 449 *
401 * Primitive swap readahead code. We simply read an aligned block of 450 * Primitive swap readahead code. We simply read an aligned block of
402 * (1 << page_cluster) entries in the swap area. This method is chosen 451 * (1 << page_cluster) entries in the swap area. This method is chosen
403 * because it doesn't cost us any seek time. We also make sure to queue 452 * because it doesn't cost us any seek time. We also make sure to queue
404 * the 'original' request together with the readahead ones... 453 * the 'original' request together with the readahead ones...
405 * 454 *
406 * This has been extended to use the NUMA policies from the mm triggering 455 * This has been extended to use the NUMA policies from the mm triggering
407 * the readahead. 456 * the readahead.
408 * 457 *
409 * Caller must hold down_read on the vma->vm_mm if vma is not NULL. 458 * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
410 */ 459 */
411 struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, 460 struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
412 struct vm_area_struct *vma, unsigned long addr) 461 struct vm_area_struct *vma, unsigned long addr)
413 { 462 {
414 struct page *page; 463 struct page *page;
415 unsigned long offset = swp_offset(entry); 464 unsigned long entry_offset = swp_offset(entry);
465 unsigned long offset = entry_offset;
416 unsigned long start_offset, end_offset; 466 unsigned long start_offset, end_offset;
417 unsigned long mask = (1UL << page_cluster) - 1; 467 unsigned long mask;
418 struct blk_plug plug; 468 struct blk_plug plug;
419 469
470 mask = swapin_nr_pages(offset) - 1;
471 if (!mask)
472 goto skip;
473
420 /* Read a page_cluster sized and aligned cluster around offset. */ 474 /* Read a page_cluster sized and aligned cluster around offset. */
421 start_offset = offset & ~mask; 475 start_offset = offset & ~mask;
422 end_offset = offset | mask; 476 end_offset = offset | mask;
423 if (!start_offset) /* First page is swap header. */ 477 if (!start_offset) /* First page is swap header. */
424 start_offset++; 478 start_offset++;
425 479
426 blk_start_plug(&plug); 480 blk_start_plug(&plug);
427 for (offset = start_offset; offset <= end_offset ; offset++) { 481 for (offset = start_offset; offset <= end_offset ; offset++) {
428 /* Ok, do the async read-ahead now */ 482 /* Ok, do the async read-ahead now */
429 page = read_swap_cache_async(swp_entry(swp_type(entry), offset), 483 page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
430 gfp_mask, vma, addr); 484 gfp_mask, vma, addr);
431 if (!page) 485 if (!page)
432 continue; 486 continue;
487 if (offset != entry_offset)
488 SetPageReadahead(page);
433 page_cache_release(page); 489 page_cache_release(page);
434 } 490 }
435 blk_finish_plug(&plug); 491 blk_finish_plug(&plug);
436 492
437 lru_add_drain(); /* Push any new pages onto the LRU now */ 493 lru_add_drain(); /* Push any new pages onto the LRU now */
494 skip:
438 return read_swap_cache_async(entry, gfp_mask, vma, addr); 495 return read_swap_cache_async(entry, gfp_mask, vma, addr);
439 } 496 }
440 497