Commit 579f82901f6f41256642936d7e632f3979ad76d4
Committed by
Linus Torvalds
1 parent
fb951eb5e1
Exists in
master
and in
16 other branches
swap: add a simple detector for inappropriate swapin readahead
This is a patch to improve swap readahead algorithm. It's from Hugh and I slightly changed it. Hugh's original changelog: swapin readahead does a blind readahead, whether or not the swapin is sequential. This may be ok on harddisk, because large reads have relatively small costs, and if the readahead pages are unneeded they can be reclaimed easily - though, what if their allocation forced reclaim of useful pages? But on SSD devices large reads are more expensive than small ones: if the readahead pages are unneeded, reading them in caused significant overhead. This patch adds very simplistic random read detection. Stealing the PageReadahead technique from Konstantin Khlebnikov's patch, avoiding the vma/anon_vma sophistications of Shaohua Li's patch, swapin_nr_pages() simply looks at readahead's current success rate, and narrows or widens its readahead window accordingly. There is little science to its heuristic: it's about as stupid as can be whilst remaining effective. The table below shows elapsed times (in centiseconds) when running a single repetitive swapping load across a 1000MB mapping in 900MB ram with 1GB swap (the harddisk tests had taken painfully too long when I used mem=500M, but SSD shows similar results for that). Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes his Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1 patch which Shaohua showed to be defective; HughNew this Nov 14 patch, with page_cluster as usual at default of 3 (8-page reads); HughPC4 this same patch with page_cluster 4 (16-page reads); HughPC0 with page_cluster 0 (1-page reads: no readahead). HDD for swapping to harddisk, SSD for swapping to VertexII SSD. Seq for sequential access to the mapping, cycling five times around; Rand for the same number of random touches. Anon for a MAP_PRIVATE anon mapping; Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs. One weakness of Shaohua's vma/anon_vma approach was that it did not optimize Shmem: seen below. Konstantin's approach was perhaps mistuned, 50% slower on Seq: did not compete and is not shown below. HDD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0 Seq Anon 73921 76210 75611 76904 78191 121542 Seq Shmem 73601 73176 73855 72947 74543 118322 Rand Anon 895392 831243 871569 845197 846496 841680 Rand Shmem 1058375 1053486 827935 764955 764376 756489 SSD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0 Seq Anon 24634 24198 24673 25107 21614 70018 Seq Shmem 24959 24932 25052 25703 22030 69678 Rand Anon 43014 26146 28075 25989 26935 25901 Rand Shmem 45349 45215 28249 24268 24138 24332 These tests are, of course, two extremes of a very simple case: under heavier mixed loads I've not yet observed any consistent improvement or degradation, and wider testing would be welcome. Shaohua Li: Test shows Vanilla is slightly better in sequential workload than Hugh's patch. I observed with Hugh's patch sometimes the readahead size is shrinked too fast (from 8 to 1 immediately) in sequential workload if there is no hit. And in such case, continuing doing readahead is good actually. I don't prepare a sophisticated algorithm for the sequential workload because so far we can't guarantee sequential accessed pages are swap out sequentially. So I slightly change Hugh's heuristic - don't shrink readahead size too fast. Here is my test result (unit second, 3 runs average): Vanilla Hugh New Seq 356 370 360 Random 4525 2447 2444 Attached graph is the swapin/swapout throughput I collected with 'vmstat 2'. The first part is running a random workload (till around 1200 of the x-axis) and the second part is running a sequential workload. swapin and swapout throughput are almost identical in steady state in both workloads. These are expected behavior. while in Vanilla, swapin is much bigger than swapout especially in random workload (because wrong readahead). Original patches by: Shaohua Li and Konstantin Khlebnikov. [fengguang.wu@intel.com: swapin_nr_pages() can be static] Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Showing 2 changed files with 62 additions and 5 deletions Inline Diff
include/linux/page-flags.h
1 | /* | 1 | /* |
2 | * Macros for manipulating and testing page->flags | 2 | * Macros for manipulating and testing page->flags |
3 | */ | 3 | */ |
4 | 4 | ||
5 | #ifndef PAGE_FLAGS_H | 5 | #ifndef PAGE_FLAGS_H |
6 | #define PAGE_FLAGS_H | 6 | #define PAGE_FLAGS_H |
7 | 7 | ||
8 | #include <linux/types.h> | 8 | #include <linux/types.h> |
9 | #include <linux/bug.h> | 9 | #include <linux/bug.h> |
10 | #include <linux/mmdebug.h> | 10 | #include <linux/mmdebug.h> |
11 | #ifndef __GENERATING_BOUNDS_H | 11 | #ifndef __GENERATING_BOUNDS_H |
12 | #include <linux/mm_types.h> | 12 | #include <linux/mm_types.h> |
13 | #include <generated/bounds.h> | 13 | #include <generated/bounds.h> |
14 | #endif /* !__GENERATING_BOUNDS_H */ | 14 | #endif /* !__GENERATING_BOUNDS_H */ |
15 | 15 | ||
16 | /* | 16 | /* |
17 | * Various page->flags bits: | 17 | * Various page->flags bits: |
18 | * | 18 | * |
19 | * PG_reserved is set for special pages, which can never be swapped out. Some | 19 | * PG_reserved is set for special pages, which can never be swapped out. Some |
20 | * of them might not even exist (eg empty_bad_page)... | 20 | * of them might not even exist (eg empty_bad_page)... |
21 | * | 21 | * |
22 | * The PG_private bitflag is set on pagecache pages if they contain filesystem | 22 | * The PG_private bitflag is set on pagecache pages if they contain filesystem |
23 | * specific data (which is normally at page->private). It can be used by | 23 | * specific data (which is normally at page->private). It can be used by |
24 | * private allocations for its own usage. | 24 | * private allocations for its own usage. |
25 | * | 25 | * |
26 | * During initiation of disk I/O, PG_locked is set. This bit is set before I/O | 26 | * During initiation of disk I/O, PG_locked is set. This bit is set before I/O |
27 | * and cleared when writeback _starts_ or when read _completes_. PG_writeback | 27 | * and cleared when writeback _starts_ or when read _completes_. PG_writeback |
28 | * is set before writeback starts and cleared when it finishes. | 28 | * is set before writeback starts and cleared when it finishes. |
29 | * | 29 | * |
30 | * PG_locked also pins a page in pagecache, and blocks truncation of the file | 30 | * PG_locked also pins a page in pagecache, and blocks truncation of the file |
31 | * while it is held. | 31 | * while it is held. |
32 | * | 32 | * |
33 | * page_waitqueue(page) is a wait queue of all tasks waiting for the page | 33 | * page_waitqueue(page) is a wait queue of all tasks waiting for the page |
34 | * to become unlocked. | 34 | * to become unlocked. |
35 | * | 35 | * |
36 | * PG_uptodate tells whether the page's contents is valid. When a read | 36 | * PG_uptodate tells whether the page's contents is valid. When a read |
37 | * completes, the page becomes uptodate, unless a disk I/O error happened. | 37 | * completes, the page becomes uptodate, unless a disk I/O error happened. |
38 | * | 38 | * |
39 | * PG_referenced, PG_reclaim are used for page reclaim for anonymous and | 39 | * PG_referenced, PG_reclaim are used for page reclaim for anonymous and |
40 | * file-backed pagecache (see mm/vmscan.c). | 40 | * file-backed pagecache (see mm/vmscan.c). |
41 | * | 41 | * |
42 | * PG_error is set to indicate that an I/O error occurred on this page. | 42 | * PG_error is set to indicate that an I/O error occurred on this page. |
43 | * | 43 | * |
44 | * PG_arch_1 is an architecture specific page state bit. The generic code | 44 | * PG_arch_1 is an architecture specific page state bit. The generic code |
45 | * guarantees that this bit is cleared for a page when it first is entered into | 45 | * guarantees that this bit is cleared for a page when it first is entered into |
46 | * the page cache. | 46 | * the page cache. |
47 | * | 47 | * |
48 | * PG_highmem pages are not permanently mapped into the kernel virtual address | 48 | * PG_highmem pages are not permanently mapped into the kernel virtual address |
49 | * space, they need to be kmapped separately for doing IO on the pages. The | 49 | * space, they need to be kmapped separately for doing IO on the pages. The |
50 | * struct page (these bits with information) are always mapped into kernel | 50 | * struct page (these bits with information) are always mapped into kernel |
51 | * address space... | 51 | * address space... |
52 | * | 52 | * |
53 | * PG_hwpoison indicates that a page got corrupted in hardware and contains | 53 | * PG_hwpoison indicates that a page got corrupted in hardware and contains |
54 | * data with incorrect ECC bits that triggered a machine check. Accessing is | 54 | * data with incorrect ECC bits that triggered a machine check. Accessing is |
55 | * not safe since it may cause another machine check. Don't touch! | 55 | * not safe since it may cause another machine check. Don't touch! |
56 | */ | 56 | */ |
57 | 57 | ||
58 | /* | 58 | /* |
59 | * Don't use the *_dontuse flags. Use the macros. Otherwise you'll break | 59 | * Don't use the *_dontuse flags. Use the macros. Otherwise you'll break |
60 | * locked- and dirty-page accounting. | 60 | * locked- and dirty-page accounting. |
61 | * | 61 | * |
62 | * The page flags field is split into two parts, the main flags area | 62 | * The page flags field is split into two parts, the main flags area |
63 | * which extends from the low bits upwards, and the fields area which | 63 | * which extends from the low bits upwards, and the fields area which |
64 | * extends from the high bits downwards. | 64 | * extends from the high bits downwards. |
65 | * | 65 | * |
66 | * | FIELD | ... | FLAGS | | 66 | * | FIELD | ... | FLAGS | |
67 | * N-1 ^ 0 | 67 | * N-1 ^ 0 |
68 | * (NR_PAGEFLAGS) | 68 | * (NR_PAGEFLAGS) |
69 | * | 69 | * |
70 | * The fields area is reserved for fields mapping zone, node (for NUMA) and | 70 | * The fields area is reserved for fields mapping zone, node (for NUMA) and |
71 | * SPARSEMEM section (for variants of SPARSEMEM that require section ids like | 71 | * SPARSEMEM section (for variants of SPARSEMEM that require section ids like |
72 | * SPARSEMEM_EXTREME with !SPARSEMEM_VMEMMAP). | 72 | * SPARSEMEM_EXTREME with !SPARSEMEM_VMEMMAP). |
73 | */ | 73 | */ |
74 | enum pageflags { | 74 | enum pageflags { |
75 | PG_locked, /* Page is locked. Don't touch. */ | 75 | PG_locked, /* Page is locked. Don't touch. */ |
76 | PG_error, | 76 | PG_error, |
77 | PG_referenced, | 77 | PG_referenced, |
78 | PG_uptodate, | 78 | PG_uptodate, |
79 | PG_dirty, | 79 | PG_dirty, |
80 | PG_lru, | 80 | PG_lru, |
81 | PG_active, | 81 | PG_active, |
82 | PG_slab, | 82 | PG_slab, |
83 | PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ | 83 | PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ |
84 | PG_arch_1, | 84 | PG_arch_1, |
85 | PG_reserved, | 85 | PG_reserved, |
86 | PG_private, /* If pagecache, has fs-private data */ | 86 | PG_private, /* If pagecache, has fs-private data */ |
87 | PG_private_2, /* If pagecache, has fs aux data */ | 87 | PG_private_2, /* If pagecache, has fs aux data */ |
88 | PG_writeback, /* Page is under writeback */ | 88 | PG_writeback, /* Page is under writeback */ |
89 | #ifdef CONFIG_PAGEFLAGS_EXTENDED | 89 | #ifdef CONFIG_PAGEFLAGS_EXTENDED |
90 | PG_head, /* A head page */ | 90 | PG_head, /* A head page */ |
91 | PG_tail, /* A tail page */ | 91 | PG_tail, /* A tail page */ |
92 | #else | 92 | #else |
93 | PG_compound, /* A compound page */ | 93 | PG_compound, /* A compound page */ |
94 | #endif | 94 | #endif |
95 | PG_swapcache, /* Swap page: swp_entry_t in private */ | 95 | PG_swapcache, /* Swap page: swp_entry_t in private */ |
96 | PG_mappedtodisk, /* Has blocks allocated on-disk */ | 96 | PG_mappedtodisk, /* Has blocks allocated on-disk */ |
97 | PG_reclaim, /* To be reclaimed asap */ | 97 | PG_reclaim, /* To be reclaimed asap */ |
98 | PG_swapbacked, /* Page is backed by RAM/swap */ | 98 | PG_swapbacked, /* Page is backed by RAM/swap */ |
99 | PG_unevictable, /* Page is "unevictable" */ | 99 | PG_unevictable, /* Page is "unevictable" */ |
100 | #ifdef CONFIG_MMU | 100 | #ifdef CONFIG_MMU |
101 | PG_mlocked, /* Page is vma mlocked */ | 101 | PG_mlocked, /* Page is vma mlocked */ |
102 | #endif | 102 | #endif |
103 | #ifdef CONFIG_ARCH_USES_PG_UNCACHED | 103 | #ifdef CONFIG_ARCH_USES_PG_UNCACHED |
104 | PG_uncached, /* Page has been mapped as uncached */ | 104 | PG_uncached, /* Page has been mapped as uncached */ |
105 | #endif | 105 | #endif |
106 | #ifdef CONFIG_MEMORY_FAILURE | 106 | #ifdef CONFIG_MEMORY_FAILURE |
107 | PG_hwpoison, /* hardware poisoned page. Don't touch */ | 107 | PG_hwpoison, /* hardware poisoned page. Don't touch */ |
108 | #endif | 108 | #endif |
109 | #ifdef CONFIG_TRANSPARENT_HUGEPAGE | 109 | #ifdef CONFIG_TRANSPARENT_HUGEPAGE |
110 | PG_compound_lock, | 110 | PG_compound_lock, |
111 | #endif | 111 | #endif |
112 | __NR_PAGEFLAGS, | 112 | __NR_PAGEFLAGS, |
113 | 113 | ||
114 | /* Filesystems */ | 114 | /* Filesystems */ |
115 | PG_checked = PG_owner_priv_1, | 115 | PG_checked = PG_owner_priv_1, |
116 | 116 | ||
117 | /* Two page bits are conscripted by FS-Cache to maintain local caching | 117 | /* Two page bits are conscripted by FS-Cache to maintain local caching |
118 | * state. These bits are set on pages belonging to the netfs's inodes | 118 | * state. These bits are set on pages belonging to the netfs's inodes |
119 | * when those inodes are being locally cached. | 119 | * when those inodes are being locally cached. |
120 | */ | 120 | */ |
121 | PG_fscache = PG_private_2, /* page backed by cache */ | 121 | PG_fscache = PG_private_2, /* page backed by cache */ |
122 | 122 | ||
123 | /* XEN */ | 123 | /* XEN */ |
124 | PG_pinned = PG_owner_priv_1, | 124 | PG_pinned = PG_owner_priv_1, |
125 | PG_savepinned = PG_dirty, | 125 | PG_savepinned = PG_dirty, |
126 | 126 | ||
127 | /* SLOB */ | 127 | /* SLOB */ |
128 | PG_slob_free = PG_private, | 128 | PG_slob_free = PG_private, |
129 | }; | 129 | }; |
130 | 130 | ||
131 | #ifndef __GENERATING_BOUNDS_H | 131 | #ifndef __GENERATING_BOUNDS_H |
132 | 132 | ||
133 | /* | 133 | /* |
134 | * Macros to create function definitions for page flags | 134 | * Macros to create function definitions for page flags |
135 | */ | 135 | */ |
136 | #define TESTPAGEFLAG(uname, lname) \ | 136 | #define TESTPAGEFLAG(uname, lname) \ |
137 | static inline int Page##uname(const struct page *page) \ | 137 | static inline int Page##uname(const struct page *page) \ |
138 | { return test_bit(PG_##lname, &page->flags); } | 138 | { return test_bit(PG_##lname, &page->flags); } |
139 | 139 | ||
140 | #define SETPAGEFLAG(uname, lname) \ | 140 | #define SETPAGEFLAG(uname, lname) \ |
141 | static inline void SetPage##uname(struct page *page) \ | 141 | static inline void SetPage##uname(struct page *page) \ |
142 | { set_bit(PG_##lname, &page->flags); } | 142 | { set_bit(PG_##lname, &page->flags); } |
143 | 143 | ||
144 | #define CLEARPAGEFLAG(uname, lname) \ | 144 | #define CLEARPAGEFLAG(uname, lname) \ |
145 | static inline void ClearPage##uname(struct page *page) \ | 145 | static inline void ClearPage##uname(struct page *page) \ |
146 | { clear_bit(PG_##lname, &page->flags); } | 146 | { clear_bit(PG_##lname, &page->flags); } |
147 | 147 | ||
148 | #define __SETPAGEFLAG(uname, lname) \ | 148 | #define __SETPAGEFLAG(uname, lname) \ |
149 | static inline void __SetPage##uname(struct page *page) \ | 149 | static inline void __SetPage##uname(struct page *page) \ |
150 | { __set_bit(PG_##lname, &page->flags); } | 150 | { __set_bit(PG_##lname, &page->flags); } |
151 | 151 | ||
152 | #define __CLEARPAGEFLAG(uname, lname) \ | 152 | #define __CLEARPAGEFLAG(uname, lname) \ |
153 | static inline void __ClearPage##uname(struct page *page) \ | 153 | static inline void __ClearPage##uname(struct page *page) \ |
154 | { __clear_bit(PG_##lname, &page->flags); } | 154 | { __clear_bit(PG_##lname, &page->flags); } |
155 | 155 | ||
156 | #define TESTSETFLAG(uname, lname) \ | 156 | #define TESTSETFLAG(uname, lname) \ |
157 | static inline int TestSetPage##uname(struct page *page) \ | 157 | static inline int TestSetPage##uname(struct page *page) \ |
158 | { return test_and_set_bit(PG_##lname, &page->flags); } | 158 | { return test_and_set_bit(PG_##lname, &page->flags); } |
159 | 159 | ||
160 | #define TESTCLEARFLAG(uname, lname) \ | 160 | #define TESTCLEARFLAG(uname, lname) \ |
161 | static inline int TestClearPage##uname(struct page *page) \ | 161 | static inline int TestClearPage##uname(struct page *page) \ |
162 | { return test_and_clear_bit(PG_##lname, &page->flags); } | 162 | { return test_and_clear_bit(PG_##lname, &page->flags); } |
163 | 163 | ||
164 | #define __TESTCLEARFLAG(uname, lname) \ | 164 | #define __TESTCLEARFLAG(uname, lname) \ |
165 | static inline int __TestClearPage##uname(struct page *page) \ | 165 | static inline int __TestClearPage##uname(struct page *page) \ |
166 | { return __test_and_clear_bit(PG_##lname, &page->flags); } | 166 | { return __test_and_clear_bit(PG_##lname, &page->flags); } |
167 | 167 | ||
168 | #define PAGEFLAG(uname, lname) TESTPAGEFLAG(uname, lname) \ | 168 | #define PAGEFLAG(uname, lname) TESTPAGEFLAG(uname, lname) \ |
169 | SETPAGEFLAG(uname, lname) CLEARPAGEFLAG(uname, lname) | 169 | SETPAGEFLAG(uname, lname) CLEARPAGEFLAG(uname, lname) |
170 | 170 | ||
171 | #define __PAGEFLAG(uname, lname) TESTPAGEFLAG(uname, lname) \ | 171 | #define __PAGEFLAG(uname, lname) TESTPAGEFLAG(uname, lname) \ |
172 | __SETPAGEFLAG(uname, lname) __CLEARPAGEFLAG(uname, lname) | 172 | __SETPAGEFLAG(uname, lname) __CLEARPAGEFLAG(uname, lname) |
173 | 173 | ||
174 | #define PAGEFLAG_FALSE(uname) \ | 174 | #define PAGEFLAG_FALSE(uname) \ |
175 | static inline int Page##uname(const struct page *page) \ | 175 | static inline int Page##uname(const struct page *page) \ |
176 | { return 0; } | 176 | { return 0; } |
177 | 177 | ||
178 | #define TESTSCFLAG(uname, lname) \ | 178 | #define TESTSCFLAG(uname, lname) \ |
179 | TESTSETFLAG(uname, lname) TESTCLEARFLAG(uname, lname) | 179 | TESTSETFLAG(uname, lname) TESTCLEARFLAG(uname, lname) |
180 | 180 | ||
181 | #define SETPAGEFLAG_NOOP(uname) \ | 181 | #define SETPAGEFLAG_NOOP(uname) \ |
182 | static inline void SetPage##uname(struct page *page) { } | 182 | static inline void SetPage##uname(struct page *page) { } |
183 | 183 | ||
184 | #define CLEARPAGEFLAG_NOOP(uname) \ | 184 | #define CLEARPAGEFLAG_NOOP(uname) \ |
185 | static inline void ClearPage##uname(struct page *page) { } | 185 | static inline void ClearPage##uname(struct page *page) { } |
186 | 186 | ||
187 | #define __CLEARPAGEFLAG_NOOP(uname) \ | 187 | #define __CLEARPAGEFLAG_NOOP(uname) \ |
188 | static inline void __ClearPage##uname(struct page *page) { } | 188 | static inline void __ClearPage##uname(struct page *page) { } |
189 | 189 | ||
190 | #define TESTCLEARFLAG_FALSE(uname) \ | 190 | #define TESTCLEARFLAG_FALSE(uname) \ |
191 | static inline int TestClearPage##uname(struct page *page) { return 0; } | 191 | static inline int TestClearPage##uname(struct page *page) { return 0; } |
192 | 192 | ||
193 | #define __TESTCLEARFLAG_FALSE(uname) \ | 193 | #define __TESTCLEARFLAG_FALSE(uname) \ |
194 | static inline int __TestClearPage##uname(struct page *page) { return 0; } | 194 | static inline int __TestClearPage##uname(struct page *page) { return 0; } |
195 | 195 | ||
196 | struct page; /* forward declaration */ | 196 | struct page; /* forward declaration */ |
197 | 197 | ||
198 | TESTPAGEFLAG(Locked, locked) | 198 | TESTPAGEFLAG(Locked, locked) |
199 | PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error) | 199 | PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error) |
200 | PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced) | 200 | PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced) |
201 | PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty) | 201 | PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty) |
202 | PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru) | 202 | PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru) |
203 | PAGEFLAG(Active, active) __CLEARPAGEFLAG(Active, active) | 203 | PAGEFLAG(Active, active) __CLEARPAGEFLAG(Active, active) |
204 | TESTCLEARFLAG(Active, active) | 204 | TESTCLEARFLAG(Active, active) |
205 | __PAGEFLAG(Slab, slab) | 205 | __PAGEFLAG(Slab, slab) |
206 | PAGEFLAG(Checked, checked) /* Used by some filesystems */ | 206 | PAGEFLAG(Checked, checked) /* Used by some filesystems */ |
207 | PAGEFLAG(Pinned, pinned) TESTSCFLAG(Pinned, pinned) /* Xen */ | 207 | PAGEFLAG(Pinned, pinned) TESTSCFLAG(Pinned, pinned) /* Xen */ |
208 | PAGEFLAG(SavePinned, savepinned); /* Xen */ | 208 | PAGEFLAG(SavePinned, savepinned); /* Xen */ |
209 | PAGEFLAG(Reserved, reserved) __CLEARPAGEFLAG(Reserved, reserved) | 209 | PAGEFLAG(Reserved, reserved) __CLEARPAGEFLAG(Reserved, reserved) |
210 | PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) | 210 | PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) |
211 | 211 | ||
212 | __PAGEFLAG(SlobFree, slob_free) | 212 | __PAGEFLAG(SlobFree, slob_free) |
213 | 213 | ||
214 | /* | 214 | /* |
215 | * Private page markings that may be used by the filesystem that owns the page | 215 | * Private page markings that may be used by the filesystem that owns the page |
216 | * for its own purposes. | 216 | * for its own purposes. |
217 | * - PG_private and PG_private_2 cause releasepage() and co to be invoked | 217 | * - PG_private and PG_private_2 cause releasepage() and co to be invoked |
218 | */ | 218 | */ |
219 | PAGEFLAG(Private, private) __SETPAGEFLAG(Private, private) | 219 | PAGEFLAG(Private, private) __SETPAGEFLAG(Private, private) |
220 | __CLEARPAGEFLAG(Private, private) | 220 | __CLEARPAGEFLAG(Private, private) |
221 | PAGEFLAG(Private2, private_2) TESTSCFLAG(Private2, private_2) | 221 | PAGEFLAG(Private2, private_2) TESTSCFLAG(Private2, private_2) |
222 | PAGEFLAG(OwnerPriv1, owner_priv_1) TESTCLEARFLAG(OwnerPriv1, owner_priv_1) | 222 | PAGEFLAG(OwnerPriv1, owner_priv_1) TESTCLEARFLAG(OwnerPriv1, owner_priv_1) |
223 | 223 | ||
224 | /* | 224 | /* |
225 | * Only test-and-set exist for PG_writeback. The unconditional operators are | 225 | * Only test-and-set exist for PG_writeback. The unconditional operators are |
226 | * risky: they bypass page accounting. | 226 | * risky: they bypass page accounting. |
227 | */ | 227 | */ |
228 | TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback) | 228 | TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback) |
229 | PAGEFLAG(MappedToDisk, mappedtodisk) | 229 | PAGEFLAG(MappedToDisk, mappedtodisk) |
230 | 230 | ||
231 | /* PG_readahead is only used for file reads; PG_reclaim is only for writes */ | 231 | /* PG_readahead is only used for reads; PG_reclaim is only for writes */ |
232 | PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim) | 232 | PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim) |
233 | PAGEFLAG(Readahead, reclaim) /* Reminder to do async read-ahead */ | 233 | PAGEFLAG(Readahead, reclaim) TESTCLEARFLAG(Readahead, reclaim) |
234 | 234 | ||
235 | #ifdef CONFIG_HIGHMEM | 235 | #ifdef CONFIG_HIGHMEM |
236 | /* | 236 | /* |
237 | * Must use a macro here due to header dependency issues. page_zone() is not | 237 | * Must use a macro here due to header dependency issues. page_zone() is not |
238 | * available at this point. | 238 | * available at this point. |
239 | */ | 239 | */ |
240 | #define PageHighMem(__p) is_highmem(page_zone(__p)) | 240 | #define PageHighMem(__p) is_highmem(page_zone(__p)) |
241 | #else | 241 | #else |
242 | PAGEFLAG_FALSE(HighMem) | 242 | PAGEFLAG_FALSE(HighMem) |
243 | #endif | 243 | #endif |
244 | 244 | ||
245 | #ifdef CONFIG_SWAP | 245 | #ifdef CONFIG_SWAP |
246 | PAGEFLAG(SwapCache, swapcache) | 246 | PAGEFLAG(SwapCache, swapcache) |
247 | #else | 247 | #else |
248 | PAGEFLAG_FALSE(SwapCache) | 248 | PAGEFLAG_FALSE(SwapCache) |
249 | SETPAGEFLAG_NOOP(SwapCache) CLEARPAGEFLAG_NOOP(SwapCache) | 249 | SETPAGEFLAG_NOOP(SwapCache) CLEARPAGEFLAG_NOOP(SwapCache) |
250 | #endif | 250 | #endif |
251 | 251 | ||
252 | PAGEFLAG(Unevictable, unevictable) __CLEARPAGEFLAG(Unevictable, unevictable) | 252 | PAGEFLAG(Unevictable, unevictable) __CLEARPAGEFLAG(Unevictable, unevictable) |
253 | TESTCLEARFLAG(Unevictable, unevictable) | 253 | TESTCLEARFLAG(Unevictable, unevictable) |
254 | 254 | ||
255 | #ifdef CONFIG_MMU | 255 | #ifdef CONFIG_MMU |
256 | PAGEFLAG(Mlocked, mlocked) __CLEARPAGEFLAG(Mlocked, mlocked) | 256 | PAGEFLAG(Mlocked, mlocked) __CLEARPAGEFLAG(Mlocked, mlocked) |
257 | TESTSCFLAG(Mlocked, mlocked) __TESTCLEARFLAG(Mlocked, mlocked) | 257 | TESTSCFLAG(Mlocked, mlocked) __TESTCLEARFLAG(Mlocked, mlocked) |
258 | #else | 258 | #else |
259 | PAGEFLAG_FALSE(Mlocked) SETPAGEFLAG_NOOP(Mlocked) | 259 | PAGEFLAG_FALSE(Mlocked) SETPAGEFLAG_NOOP(Mlocked) |
260 | TESTCLEARFLAG_FALSE(Mlocked) __TESTCLEARFLAG_FALSE(Mlocked) | 260 | TESTCLEARFLAG_FALSE(Mlocked) __TESTCLEARFLAG_FALSE(Mlocked) |
261 | #endif | 261 | #endif |
262 | 262 | ||
263 | #ifdef CONFIG_ARCH_USES_PG_UNCACHED | 263 | #ifdef CONFIG_ARCH_USES_PG_UNCACHED |
264 | PAGEFLAG(Uncached, uncached) | 264 | PAGEFLAG(Uncached, uncached) |
265 | #else | 265 | #else |
266 | PAGEFLAG_FALSE(Uncached) | 266 | PAGEFLAG_FALSE(Uncached) |
267 | #endif | 267 | #endif |
268 | 268 | ||
269 | #ifdef CONFIG_MEMORY_FAILURE | 269 | #ifdef CONFIG_MEMORY_FAILURE |
270 | PAGEFLAG(HWPoison, hwpoison) | 270 | PAGEFLAG(HWPoison, hwpoison) |
271 | TESTSCFLAG(HWPoison, hwpoison) | 271 | TESTSCFLAG(HWPoison, hwpoison) |
272 | #define __PG_HWPOISON (1UL << PG_hwpoison) | 272 | #define __PG_HWPOISON (1UL << PG_hwpoison) |
273 | #else | 273 | #else |
274 | PAGEFLAG_FALSE(HWPoison) | 274 | PAGEFLAG_FALSE(HWPoison) |
275 | #define __PG_HWPOISON 0 | 275 | #define __PG_HWPOISON 0 |
276 | #endif | 276 | #endif |
277 | 277 | ||
278 | u64 stable_page_flags(struct page *page); | 278 | u64 stable_page_flags(struct page *page); |
279 | 279 | ||
280 | static inline int PageUptodate(struct page *page) | 280 | static inline int PageUptodate(struct page *page) |
281 | { | 281 | { |
282 | int ret = test_bit(PG_uptodate, &(page)->flags); | 282 | int ret = test_bit(PG_uptodate, &(page)->flags); |
283 | 283 | ||
284 | /* | 284 | /* |
285 | * Must ensure that the data we read out of the page is loaded | 285 | * Must ensure that the data we read out of the page is loaded |
286 | * _after_ we've loaded page->flags to check for PageUptodate. | 286 | * _after_ we've loaded page->flags to check for PageUptodate. |
287 | * We can skip the barrier if the page is not uptodate, because | 287 | * We can skip the barrier if the page is not uptodate, because |
288 | * we wouldn't be reading anything from it. | 288 | * we wouldn't be reading anything from it. |
289 | * | 289 | * |
290 | * See SetPageUptodate() for the other side of the story. | 290 | * See SetPageUptodate() for the other side of the story. |
291 | */ | 291 | */ |
292 | if (ret) | 292 | if (ret) |
293 | smp_rmb(); | 293 | smp_rmb(); |
294 | 294 | ||
295 | return ret; | 295 | return ret; |
296 | } | 296 | } |
297 | 297 | ||
298 | static inline void __SetPageUptodate(struct page *page) | 298 | static inline void __SetPageUptodate(struct page *page) |
299 | { | 299 | { |
300 | smp_wmb(); | 300 | smp_wmb(); |
301 | __set_bit(PG_uptodate, &(page)->flags); | 301 | __set_bit(PG_uptodate, &(page)->flags); |
302 | } | 302 | } |
303 | 303 | ||
304 | static inline void SetPageUptodate(struct page *page) | 304 | static inline void SetPageUptodate(struct page *page) |
305 | { | 305 | { |
306 | /* | 306 | /* |
307 | * Memory barrier must be issued before setting the PG_uptodate bit, | 307 | * Memory barrier must be issued before setting the PG_uptodate bit, |
308 | * so that all previous stores issued in order to bring the page | 308 | * so that all previous stores issued in order to bring the page |
309 | * uptodate are actually visible before PageUptodate becomes true. | 309 | * uptodate are actually visible before PageUptodate becomes true. |
310 | */ | 310 | */ |
311 | smp_wmb(); | 311 | smp_wmb(); |
312 | set_bit(PG_uptodate, &(page)->flags); | 312 | set_bit(PG_uptodate, &(page)->flags); |
313 | } | 313 | } |
314 | 314 | ||
315 | CLEARPAGEFLAG(Uptodate, uptodate) | 315 | CLEARPAGEFLAG(Uptodate, uptodate) |
316 | 316 | ||
317 | extern void cancel_dirty_page(struct page *page, unsigned int account_size); | 317 | extern void cancel_dirty_page(struct page *page, unsigned int account_size); |
318 | 318 | ||
319 | int test_clear_page_writeback(struct page *page); | 319 | int test_clear_page_writeback(struct page *page); |
320 | int test_set_page_writeback(struct page *page); | 320 | int test_set_page_writeback(struct page *page); |
321 | 321 | ||
322 | static inline void set_page_writeback(struct page *page) | 322 | static inline void set_page_writeback(struct page *page) |
323 | { | 323 | { |
324 | test_set_page_writeback(page); | 324 | test_set_page_writeback(page); |
325 | } | 325 | } |
326 | 326 | ||
327 | #ifdef CONFIG_PAGEFLAGS_EXTENDED | 327 | #ifdef CONFIG_PAGEFLAGS_EXTENDED |
328 | /* | 328 | /* |
329 | * System with lots of page flags available. This allows separate | 329 | * System with lots of page flags available. This allows separate |
330 | * flags for PageHead() and PageTail() checks of compound pages so that bit | 330 | * flags for PageHead() and PageTail() checks of compound pages so that bit |
331 | * tests can be used in performance sensitive paths. PageCompound is | 331 | * tests can be used in performance sensitive paths. PageCompound is |
332 | * generally not used in hot code paths except arch/powerpc/mm/init_64.c | 332 | * generally not used in hot code paths except arch/powerpc/mm/init_64.c |
333 | * and arch/powerpc/kvm/book3s_64_vio_hv.c which use it to detect huge pages | 333 | * and arch/powerpc/kvm/book3s_64_vio_hv.c which use it to detect huge pages |
334 | * and avoid handling those in real mode. | 334 | * and avoid handling those in real mode. |
335 | */ | 335 | */ |
336 | __PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head) | 336 | __PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head) |
337 | __PAGEFLAG(Tail, tail) | 337 | __PAGEFLAG(Tail, tail) |
338 | 338 | ||
339 | static inline int PageCompound(struct page *page) | 339 | static inline int PageCompound(struct page *page) |
340 | { | 340 | { |
341 | return page->flags & ((1L << PG_head) | (1L << PG_tail)); | 341 | return page->flags & ((1L << PG_head) | (1L << PG_tail)); |
342 | 342 | ||
343 | } | 343 | } |
344 | #ifdef CONFIG_TRANSPARENT_HUGEPAGE | 344 | #ifdef CONFIG_TRANSPARENT_HUGEPAGE |
345 | static inline void ClearPageCompound(struct page *page) | 345 | static inline void ClearPageCompound(struct page *page) |
346 | { | 346 | { |
347 | BUG_ON(!PageHead(page)); | 347 | BUG_ON(!PageHead(page)); |
348 | ClearPageHead(page); | 348 | ClearPageHead(page); |
349 | } | 349 | } |
350 | #endif | 350 | #endif |
351 | #else | 351 | #else |
352 | /* | 352 | /* |
353 | * Reduce page flag use as much as possible by overlapping | 353 | * Reduce page flag use as much as possible by overlapping |
354 | * compound page flags with the flags used for page cache pages. Possible | 354 | * compound page flags with the flags used for page cache pages. Possible |
355 | * because PageCompound is always set for compound pages and not for | 355 | * because PageCompound is always set for compound pages and not for |
356 | * pages on the LRU and/or pagecache. | 356 | * pages on the LRU and/or pagecache. |
357 | */ | 357 | */ |
358 | TESTPAGEFLAG(Compound, compound) | 358 | TESTPAGEFLAG(Compound, compound) |
359 | __SETPAGEFLAG(Head, compound) __CLEARPAGEFLAG(Head, compound) | 359 | __SETPAGEFLAG(Head, compound) __CLEARPAGEFLAG(Head, compound) |
360 | 360 | ||
361 | /* | 361 | /* |
362 | * PG_reclaim is used in combination with PG_compound to mark the | 362 | * PG_reclaim is used in combination with PG_compound to mark the |
363 | * head and tail of a compound page. This saves one page flag | 363 | * head and tail of a compound page. This saves one page flag |
364 | * but makes it impossible to use compound pages for the page cache. | 364 | * but makes it impossible to use compound pages for the page cache. |
365 | * The PG_reclaim bit would have to be used for reclaim or readahead | 365 | * The PG_reclaim bit would have to be used for reclaim or readahead |
366 | * if compound pages enter the page cache. | 366 | * if compound pages enter the page cache. |
367 | * | 367 | * |
368 | * PG_compound & PG_reclaim => Tail page | 368 | * PG_compound & PG_reclaim => Tail page |
369 | * PG_compound & ~PG_reclaim => Head page | 369 | * PG_compound & ~PG_reclaim => Head page |
370 | */ | 370 | */ |
371 | #define PG_head_mask ((1L << PG_compound)) | 371 | #define PG_head_mask ((1L << PG_compound)) |
372 | #define PG_head_tail_mask ((1L << PG_compound) | (1L << PG_reclaim)) | 372 | #define PG_head_tail_mask ((1L << PG_compound) | (1L << PG_reclaim)) |
373 | 373 | ||
374 | static inline int PageHead(struct page *page) | 374 | static inline int PageHead(struct page *page) |
375 | { | 375 | { |
376 | return ((page->flags & PG_head_tail_mask) == PG_head_mask); | 376 | return ((page->flags & PG_head_tail_mask) == PG_head_mask); |
377 | } | 377 | } |
378 | 378 | ||
379 | static inline int PageTail(struct page *page) | 379 | static inline int PageTail(struct page *page) |
380 | { | 380 | { |
381 | return ((page->flags & PG_head_tail_mask) == PG_head_tail_mask); | 381 | return ((page->flags & PG_head_tail_mask) == PG_head_tail_mask); |
382 | } | 382 | } |
383 | 383 | ||
384 | static inline void __SetPageTail(struct page *page) | 384 | static inline void __SetPageTail(struct page *page) |
385 | { | 385 | { |
386 | page->flags |= PG_head_tail_mask; | 386 | page->flags |= PG_head_tail_mask; |
387 | } | 387 | } |
388 | 388 | ||
389 | static inline void __ClearPageTail(struct page *page) | 389 | static inline void __ClearPageTail(struct page *page) |
390 | { | 390 | { |
391 | page->flags &= ~PG_head_tail_mask; | 391 | page->flags &= ~PG_head_tail_mask; |
392 | } | 392 | } |
393 | 393 | ||
394 | #ifdef CONFIG_TRANSPARENT_HUGEPAGE | 394 | #ifdef CONFIG_TRANSPARENT_HUGEPAGE |
395 | static inline void ClearPageCompound(struct page *page) | 395 | static inline void ClearPageCompound(struct page *page) |
396 | { | 396 | { |
397 | BUG_ON((page->flags & PG_head_tail_mask) != (1 << PG_compound)); | 397 | BUG_ON((page->flags & PG_head_tail_mask) != (1 << PG_compound)); |
398 | clear_bit(PG_compound, &page->flags); | 398 | clear_bit(PG_compound, &page->flags); |
399 | } | 399 | } |
400 | #endif | 400 | #endif |
401 | 401 | ||
402 | #endif /* !PAGEFLAGS_EXTENDED */ | 402 | #endif /* !PAGEFLAGS_EXTENDED */ |
403 | 403 | ||
404 | #ifdef CONFIG_TRANSPARENT_HUGEPAGE | 404 | #ifdef CONFIG_TRANSPARENT_HUGEPAGE |
405 | /* | 405 | /* |
406 | * PageHuge() only returns true for hugetlbfs pages, but not for | 406 | * PageHuge() only returns true for hugetlbfs pages, but not for |
407 | * normal or transparent huge pages. | 407 | * normal or transparent huge pages. |
408 | * | 408 | * |
409 | * PageTransHuge() returns true for both transparent huge and | 409 | * PageTransHuge() returns true for both transparent huge and |
410 | * hugetlbfs pages, but not normal pages. PageTransHuge() can only be | 410 | * hugetlbfs pages, but not normal pages. PageTransHuge() can only be |
411 | * called only in the core VM paths where hugetlbfs pages can't exist. | 411 | * called only in the core VM paths where hugetlbfs pages can't exist. |
412 | */ | 412 | */ |
413 | static inline int PageTransHuge(struct page *page) | 413 | static inline int PageTransHuge(struct page *page) |
414 | { | 414 | { |
415 | VM_BUG_ON_PAGE(PageTail(page), page); | 415 | VM_BUG_ON_PAGE(PageTail(page), page); |
416 | return PageHead(page); | 416 | return PageHead(page); |
417 | } | 417 | } |
418 | 418 | ||
419 | /* | 419 | /* |
420 | * PageTransCompound returns true for both transparent huge pages | 420 | * PageTransCompound returns true for both transparent huge pages |
421 | * and hugetlbfs pages, so it should only be called when it's known | 421 | * and hugetlbfs pages, so it should only be called when it's known |
422 | * that hugetlbfs pages aren't involved. | 422 | * that hugetlbfs pages aren't involved. |
423 | */ | 423 | */ |
424 | static inline int PageTransCompound(struct page *page) | 424 | static inline int PageTransCompound(struct page *page) |
425 | { | 425 | { |
426 | return PageCompound(page); | 426 | return PageCompound(page); |
427 | } | 427 | } |
428 | 428 | ||
429 | /* | 429 | /* |
430 | * PageTransTail returns true for both transparent huge pages | 430 | * PageTransTail returns true for both transparent huge pages |
431 | * and hugetlbfs pages, so it should only be called when it's known | 431 | * and hugetlbfs pages, so it should only be called when it's known |
432 | * that hugetlbfs pages aren't involved. | 432 | * that hugetlbfs pages aren't involved. |
433 | */ | 433 | */ |
434 | static inline int PageTransTail(struct page *page) | 434 | static inline int PageTransTail(struct page *page) |
435 | { | 435 | { |
436 | return PageTail(page); | 436 | return PageTail(page); |
437 | } | 437 | } |
438 | 438 | ||
439 | #else | 439 | #else |
440 | 440 | ||
441 | static inline int PageTransHuge(struct page *page) | 441 | static inline int PageTransHuge(struct page *page) |
442 | { | 442 | { |
443 | return 0; | 443 | return 0; |
444 | } | 444 | } |
445 | 445 | ||
446 | static inline int PageTransCompound(struct page *page) | 446 | static inline int PageTransCompound(struct page *page) |
447 | { | 447 | { |
448 | return 0; | 448 | return 0; |
449 | } | 449 | } |
450 | 450 | ||
451 | static inline int PageTransTail(struct page *page) | 451 | static inline int PageTransTail(struct page *page) |
452 | { | 452 | { |
453 | return 0; | 453 | return 0; |
454 | } | 454 | } |
455 | #endif | 455 | #endif |
456 | 456 | ||
457 | /* | 457 | /* |
458 | * If network-based swap is enabled, sl*b must keep track of whether pages | 458 | * If network-based swap is enabled, sl*b must keep track of whether pages |
459 | * were allocated from pfmemalloc reserves. | 459 | * were allocated from pfmemalloc reserves. |
460 | */ | 460 | */ |
461 | static inline int PageSlabPfmemalloc(struct page *page) | 461 | static inline int PageSlabPfmemalloc(struct page *page) |
462 | { | 462 | { |
463 | VM_BUG_ON_PAGE(!PageSlab(page), page); | 463 | VM_BUG_ON_PAGE(!PageSlab(page), page); |
464 | return PageActive(page); | 464 | return PageActive(page); |
465 | } | 465 | } |
466 | 466 | ||
467 | static inline void SetPageSlabPfmemalloc(struct page *page) | 467 | static inline void SetPageSlabPfmemalloc(struct page *page) |
468 | { | 468 | { |
469 | VM_BUG_ON_PAGE(!PageSlab(page), page); | 469 | VM_BUG_ON_PAGE(!PageSlab(page), page); |
470 | SetPageActive(page); | 470 | SetPageActive(page); |
471 | } | 471 | } |
472 | 472 | ||
473 | static inline void __ClearPageSlabPfmemalloc(struct page *page) | 473 | static inline void __ClearPageSlabPfmemalloc(struct page *page) |
474 | { | 474 | { |
475 | VM_BUG_ON_PAGE(!PageSlab(page), page); | 475 | VM_BUG_ON_PAGE(!PageSlab(page), page); |
476 | __ClearPageActive(page); | 476 | __ClearPageActive(page); |
477 | } | 477 | } |
478 | 478 | ||
479 | static inline void ClearPageSlabPfmemalloc(struct page *page) | 479 | static inline void ClearPageSlabPfmemalloc(struct page *page) |
480 | { | 480 | { |
481 | VM_BUG_ON_PAGE(!PageSlab(page), page); | 481 | VM_BUG_ON_PAGE(!PageSlab(page), page); |
482 | ClearPageActive(page); | 482 | ClearPageActive(page); |
483 | } | 483 | } |
484 | 484 | ||
485 | #ifdef CONFIG_MMU | 485 | #ifdef CONFIG_MMU |
486 | #define __PG_MLOCKED (1 << PG_mlocked) | 486 | #define __PG_MLOCKED (1 << PG_mlocked) |
487 | #else | 487 | #else |
488 | #define __PG_MLOCKED 0 | 488 | #define __PG_MLOCKED 0 |
489 | #endif | 489 | #endif |
490 | 490 | ||
491 | #ifdef CONFIG_TRANSPARENT_HUGEPAGE | 491 | #ifdef CONFIG_TRANSPARENT_HUGEPAGE |
492 | #define __PG_COMPOUND_LOCK (1 << PG_compound_lock) | 492 | #define __PG_COMPOUND_LOCK (1 << PG_compound_lock) |
493 | #else | 493 | #else |
494 | #define __PG_COMPOUND_LOCK 0 | 494 | #define __PG_COMPOUND_LOCK 0 |
495 | #endif | 495 | #endif |
496 | 496 | ||
497 | /* | 497 | /* |
498 | * Flags checked when a page is freed. Pages being freed should not have | 498 | * Flags checked when a page is freed. Pages being freed should not have |
499 | * these flags set. It they are, there is a problem. | 499 | * these flags set. It they are, there is a problem. |
500 | */ | 500 | */ |
501 | #define PAGE_FLAGS_CHECK_AT_FREE \ | 501 | #define PAGE_FLAGS_CHECK_AT_FREE \ |
502 | (1 << PG_lru | 1 << PG_locked | \ | 502 | (1 << PG_lru | 1 << PG_locked | \ |
503 | 1 << PG_private | 1 << PG_private_2 | \ | 503 | 1 << PG_private | 1 << PG_private_2 | \ |
504 | 1 << PG_writeback | 1 << PG_reserved | \ | 504 | 1 << PG_writeback | 1 << PG_reserved | \ |
505 | 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ | 505 | 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ |
506 | 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ | 506 | 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ |
507 | __PG_COMPOUND_LOCK) | 507 | __PG_COMPOUND_LOCK) |
508 | 508 | ||
509 | /* | 509 | /* |
510 | * Flags checked when a page is prepped for return by the page allocator. | 510 | * Flags checked when a page is prepped for return by the page allocator. |
511 | * Pages being prepped should not have any flags set. It they are set, | 511 | * Pages being prepped should not have any flags set. It they are set, |
512 | * there has been a kernel bug or struct page corruption. | 512 | * there has been a kernel bug or struct page corruption. |
513 | */ | 513 | */ |
514 | #define PAGE_FLAGS_CHECK_AT_PREP ((1 << NR_PAGEFLAGS) - 1) | 514 | #define PAGE_FLAGS_CHECK_AT_PREP ((1 << NR_PAGEFLAGS) - 1) |
515 | 515 | ||
516 | #define PAGE_FLAGS_PRIVATE \ | 516 | #define PAGE_FLAGS_PRIVATE \ |
517 | (1 << PG_private | 1 << PG_private_2) | 517 | (1 << PG_private | 1 << PG_private_2) |
518 | /** | 518 | /** |
519 | * page_has_private - Determine if page has private stuff | 519 | * page_has_private - Determine if page has private stuff |
520 | * @page: The page to be checked | 520 | * @page: The page to be checked |
521 | * | 521 | * |
522 | * Determine if a page has private stuff, indicating that release routines | 522 | * Determine if a page has private stuff, indicating that release routines |
523 | * should be invoked upon it. | 523 | * should be invoked upon it. |
524 | */ | 524 | */ |
525 | static inline int page_has_private(struct page *page) | 525 | static inline int page_has_private(struct page *page) |
526 | { | 526 | { |
527 | return !!(page->flags & PAGE_FLAGS_PRIVATE); | 527 | return !!(page->flags & PAGE_FLAGS_PRIVATE); |
528 | } | 528 | } |
529 | 529 | ||
530 | #endif /* !__GENERATING_BOUNDS_H */ | 530 | #endif /* !__GENERATING_BOUNDS_H */ |
531 | 531 | ||
532 | #endif /* PAGE_FLAGS_H */ | 532 | #endif /* PAGE_FLAGS_H */ |
533 | 533 |
mm/swap_state.c
1 | /* | 1 | /* |
2 | * linux/mm/swap_state.c | 2 | * linux/mm/swap_state.c |
3 | * | 3 | * |
4 | * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds | 4 | * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds |
5 | * Swap reorganised 29.12.95, Stephen Tweedie | 5 | * Swap reorganised 29.12.95, Stephen Tweedie |
6 | * | 6 | * |
7 | * Rewritten to use page cache, (C) 1998 Stephen Tweedie | 7 | * Rewritten to use page cache, (C) 1998 Stephen Tweedie |
8 | */ | 8 | */ |
9 | #include <linux/mm.h> | 9 | #include <linux/mm.h> |
10 | #include <linux/gfp.h> | 10 | #include <linux/gfp.h> |
11 | #include <linux/kernel_stat.h> | 11 | #include <linux/kernel_stat.h> |
12 | #include <linux/swap.h> | 12 | #include <linux/swap.h> |
13 | #include <linux/swapops.h> | 13 | #include <linux/swapops.h> |
14 | #include <linux/init.h> | 14 | #include <linux/init.h> |
15 | #include <linux/pagemap.h> | 15 | #include <linux/pagemap.h> |
16 | #include <linux/backing-dev.h> | 16 | #include <linux/backing-dev.h> |
17 | #include <linux/blkdev.h> | 17 | #include <linux/blkdev.h> |
18 | #include <linux/pagevec.h> | 18 | #include <linux/pagevec.h> |
19 | #include <linux/migrate.h> | 19 | #include <linux/migrate.h> |
20 | #include <linux/page_cgroup.h> | 20 | #include <linux/page_cgroup.h> |
21 | 21 | ||
22 | #include <asm/pgtable.h> | 22 | #include <asm/pgtable.h> |
23 | 23 | ||
24 | /* | 24 | /* |
25 | * swapper_space is a fiction, retained to simplify the path through | 25 | * swapper_space is a fiction, retained to simplify the path through |
26 | * vmscan's shrink_page_list. | 26 | * vmscan's shrink_page_list. |
27 | */ | 27 | */ |
28 | static const struct address_space_operations swap_aops = { | 28 | static const struct address_space_operations swap_aops = { |
29 | .writepage = swap_writepage, | 29 | .writepage = swap_writepage, |
30 | .set_page_dirty = swap_set_page_dirty, | 30 | .set_page_dirty = swap_set_page_dirty, |
31 | .migratepage = migrate_page, | 31 | .migratepage = migrate_page, |
32 | }; | 32 | }; |
33 | 33 | ||
34 | static struct backing_dev_info swap_backing_dev_info = { | 34 | static struct backing_dev_info swap_backing_dev_info = { |
35 | .name = "swap", | 35 | .name = "swap", |
36 | .capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK | BDI_CAP_SWAP_BACKED, | 36 | .capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK | BDI_CAP_SWAP_BACKED, |
37 | }; | 37 | }; |
38 | 38 | ||
39 | struct address_space swapper_spaces[MAX_SWAPFILES] = { | 39 | struct address_space swapper_spaces[MAX_SWAPFILES] = { |
40 | [0 ... MAX_SWAPFILES - 1] = { | 40 | [0 ... MAX_SWAPFILES - 1] = { |
41 | .page_tree = RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN), | 41 | .page_tree = RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN), |
42 | .a_ops = &swap_aops, | 42 | .a_ops = &swap_aops, |
43 | .backing_dev_info = &swap_backing_dev_info, | 43 | .backing_dev_info = &swap_backing_dev_info, |
44 | } | 44 | } |
45 | }; | 45 | }; |
46 | 46 | ||
47 | #define INC_CACHE_INFO(x) do { swap_cache_info.x++; } while (0) | 47 | #define INC_CACHE_INFO(x) do { swap_cache_info.x++; } while (0) |
48 | 48 | ||
49 | static struct { | 49 | static struct { |
50 | unsigned long add_total; | 50 | unsigned long add_total; |
51 | unsigned long del_total; | 51 | unsigned long del_total; |
52 | unsigned long find_success; | 52 | unsigned long find_success; |
53 | unsigned long find_total; | 53 | unsigned long find_total; |
54 | } swap_cache_info; | 54 | } swap_cache_info; |
55 | 55 | ||
56 | unsigned long total_swapcache_pages(void) | 56 | unsigned long total_swapcache_pages(void) |
57 | { | 57 | { |
58 | int i; | 58 | int i; |
59 | unsigned long ret = 0; | 59 | unsigned long ret = 0; |
60 | 60 | ||
61 | for (i = 0; i < MAX_SWAPFILES; i++) | 61 | for (i = 0; i < MAX_SWAPFILES; i++) |
62 | ret += swapper_spaces[i].nrpages; | 62 | ret += swapper_spaces[i].nrpages; |
63 | return ret; | 63 | return ret; |
64 | } | 64 | } |
65 | 65 | ||
66 | static atomic_t swapin_readahead_hits = ATOMIC_INIT(4); | ||
67 | |||
66 | void show_swap_cache_info(void) | 68 | void show_swap_cache_info(void) |
67 | { | 69 | { |
68 | printk("%lu pages in swap cache\n", total_swapcache_pages()); | 70 | printk("%lu pages in swap cache\n", total_swapcache_pages()); |
69 | printk("Swap cache stats: add %lu, delete %lu, find %lu/%lu\n", | 71 | printk("Swap cache stats: add %lu, delete %lu, find %lu/%lu\n", |
70 | swap_cache_info.add_total, swap_cache_info.del_total, | 72 | swap_cache_info.add_total, swap_cache_info.del_total, |
71 | swap_cache_info.find_success, swap_cache_info.find_total); | 73 | swap_cache_info.find_success, swap_cache_info.find_total); |
72 | printk("Free swap = %ldkB\n", | 74 | printk("Free swap = %ldkB\n", |
73 | get_nr_swap_pages() << (PAGE_SHIFT - 10)); | 75 | get_nr_swap_pages() << (PAGE_SHIFT - 10)); |
74 | printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10)); | 76 | printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10)); |
75 | } | 77 | } |
76 | 78 | ||
77 | /* | 79 | /* |
78 | * __add_to_swap_cache resembles add_to_page_cache_locked on swapper_space, | 80 | * __add_to_swap_cache resembles add_to_page_cache_locked on swapper_space, |
79 | * but sets SwapCache flag and private instead of mapping and index. | 81 | * but sets SwapCache flag and private instead of mapping and index. |
80 | */ | 82 | */ |
81 | int __add_to_swap_cache(struct page *page, swp_entry_t entry) | 83 | int __add_to_swap_cache(struct page *page, swp_entry_t entry) |
82 | { | 84 | { |
83 | int error; | 85 | int error; |
84 | struct address_space *address_space; | 86 | struct address_space *address_space; |
85 | 87 | ||
86 | VM_BUG_ON_PAGE(!PageLocked(page), page); | 88 | VM_BUG_ON_PAGE(!PageLocked(page), page); |
87 | VM_BUG_ON_PAGE(PageSwapCache(page), page); | 89 | VM_BUG_ON_PAGE(PageSwapCache(page), page); |
88 | VM_BUG_ON_PAGE(!PageSwapBacked(page), page); | 90 | VM_BUG_ON_PAGE(!PageSwapBacked(page), page); |
89 | 91 | ||
90 | page_cache_get(page); | 92 | page_cache_get(page); |
91 | SetPageSwapCache(page); | 93 | SetPageSwapCache(page); |
92 | set_page_private(page, entry.val); | 94 | set_page_private(page, entry.val); |
93 | 95 | ||
94 | address_space = swap_address_space(entry); | 96 | address_space = swap_address_space(entry); |
95 | spin_lock_irq(&address_space->tree_lock); | 97 | spin_lock_irq(&address_space->tree_lock); |
96 | error = radix_tree_insert(&address_space->page_tree, | 98 | error = radix_tree_insert(&address_space->page_tree, |
97 | entry.val, page); | 99 | entry.val, page); |
98 | if (likely(!error)) { | 100 | if (likely(!error)) { |
99 | address_space->nrpages++; | 101 | address_space->nrpages++; |
100 | __inc_zone_page_state(page, NR_FILE_PAGES); | 102 | __inc_zone_page_state(page, NR_FILE_PAGES); |
101 | INC_CACHE_INFO(add_total); | 103 | INC_CACHE_INFO(add_total); |
102 | } | 104 | } |
103 | spin_unlock_irq(&address_space->tree_lock); | 105 | spin_unlock_irq(&address_space->tree_lock); |
104 | 106 | ||
105 | if (unlikely(error)) { | 107 | if (unlikely(error)) { |
106 | /* | 108 | /* |
107 | * Only the context which have set SWAP_HAS_CACHE flag | 109 | * Only the context which have set SWAP_HAS_CACHE flag |
108 | * would call add_to_swap_cache(). | 110 | * would call add_to_swap_cache(). |
109 | * So add_to_swap_cache() doesn't returns -EEXIST. | 111 | * So add_to_swap_cache() doesn't returns -EEXIST. |
110 | */ | 112 | */ |
111 | VM_BUG_ON(error == -EEXIST); | 113 | VM_BUG_ON(error == -EEXIST); |
112 | set_page_private(page, 0UL); | 114 | set_page_private(page, 0UL); |
113 | ClearPageSwapCache(page); | 115 | ClearPageSwapCache(page); |
114 | page_cache_release(page); | 116 | page_cache_release(page); |
115 | } | 117 | } |
116 | 118 | ||
117 | return error; | 119 | return error; |
118 | } | 120 | } |
119 | 121 | ||
120 | 122 | ||
121 | int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask) | 123 | int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask) |
122 | { | 124 | { |
123 | int error; | 125 | int error; |
124 | 126 | ||
125 | error = radix_tree_maybe_preload(gfp_mask); | 127 | error = radix_tree_maybe_preload(gfp_mask); |
126 | if (!error) { | 128 | if (!error) { |
127 | error = __add_to_swap_cache(page, entry); | 129 | error = __add_to_swap_cache(page, entry); |
128 | radix_tree_preload_end(); | 130 | radix_tree_preload_end(); |
129 | } | 131 | } |
130 | return error; | 132 | return error; |
131 | } | 133 | } |
132 | 134 | ||
133 | /* | 135 | /* |
134 | * This must be called only on pages that have | 136 | * This must be called only on pages that have |
135 | * been verified to be in the swap cache. | 137 | * been verified to be in the swap cache. |
136 | */ | 138 | */ |
137 | void __delete_from_swap_cache(struct page *page) | 139 | void __delete_from_swap_cache(struct page *page) |
138 | { | 140 | { |
139 | swp_entry_t entry; | 141 | swp_entry_t entry; |
140 | struct address_space *address_space; | 142 | struct address_space *address_space; |
141 | 143 | ||
142 | VM_BUG_ON_PAGE(!PageLocked(page), page); | 144 | VM_BUG_ON_PAGE(!PageLocked(page), page); |
143 | VM_BUG_ON_PAGE(!PageSwapCache(page), page); | 145 | VM_BUG_ON_PAGE(!PageSwapCache(page), page); |
144 | VM_BUG_ON_PAGE(PageWriteback(page), page); | 146 | VM_BUG_ON_PAGE(PageWriteback(page), page); |
145 | 147 | ||
146 | entry.val = page_private(page); | 148 | entry.val = page_private(page); |
147 | address_space = swap_address_space(entry); | 149 | address_space = swap_address_space(entry); |
148 | radix_tree_delete(&address_space->page_tree, page_private(page)); | 150 | radix_tree_delete(&address_space->page_tree, page_private(page)); |
149 | set_page_private(page, 0); | 151 | set_page_private(page, 0); |
150 | ClearPageSwapCache(page); | 152 | ClearPageSwapCache(page); |
151 | address_space->nrpages--; | 153 | address_space->nrpages--; |
152 | __dec_zone_page_state(page, NR_FILE_PAGES); | 154 | __dec_zone_page_state(page, NR_FILE_PAGES); |
153 | INC_CACHE_INFO(del_total); | 155 | INC_CACHE_INFO(del_total); |
154 | } | 156 | } |
155 | 157 | ||
156 | /** | 158 | /** |
157 | * add_to_swap - allocate swap space for a page | 159 | * add_to_swap - allocate swap space for a page |
158 | * @page: page we want to move to swap | 160 | * @page: page we want to move to swap |
159 | * | 161 | * |
160 | * Allocate swap space for the page and add the page to the | 162 | * Allocate swap space for the page and add the page to the |
161 | * swap cache. Caller needs to hold the page lock. | 163 | * swap cache. Caller needs to hold the page lock. |
162 | */ | 164 | */ |
163 | int add_to_swap(struct page *page, struct list_head *list) | 165 | int add_to_swap(struct page *page, struct list_head *list) |
164 | { | 166 | { |
165 | swp_entry_t entry; | 167 | swp_entry_t entry; |
166 | int err; | 168 | int err; |
167 | 169 | ||
168 | VM_BUG_ON_PAGE(!PageLocked(page), page); | 170 | VM_BUG_ON_PAGE(!PageLocked(page), page); |
169 | VM_BUG_ON_PAGE(!PageUptodate(page), page); | 171 | VM_BUG_ON_PAGE(!PageUptodate(page), page); |
170 | 172 | ||
171 | entry = get_swap_page(); | 173 | entry = get_swap_page(); |
172 | if (!entry.val) | 174 | if (!entry.val) |
173 | return 0; | 175 | return 0; |
174 | 176 | ||
175 | if (unlikely(PageTransHuge(page))) | 177 | if (unlikely(PageTransHuge(page))) |
176 | if (unlikely(split_huge_page_to_list(page, list))) { | 178 | if (unlikely(split_huge_page_to_list(page, list))) { |
177 | swapcache_free(entry, NULL); | 179 | swapcache_free(entry, NULL); |
178 | return 0; | 180 | return 0; |
179 | } | 181 | } |
180 | 182 | ||
181 | /* | 183 | /* |
182 | * Radix-tree node allocations from PF_MEMALLOC contexts could | 184 | * Radix-tree node allocations from PF_MEMALLOC contexts could |
183 | * completely exhaust the page allocator. __GFP_NOMEMALLOC | 185 | * completely exhaust the page allocator. __GFP_NOMEMALLOC |
184 | * stops emergency reserves from being allocated. | 186 | * stops emergency reserves from being allocated. |
185 | * | 187 | * |
186 | * TODO: this could cause a theoretical memory reclaim | 188 | * TODO: this could cause a theoretical memory reclaim |
187 | * deadlock in the swap out path. | 189 | * deadlock in the swap out path. |
188 | */ | 190 | */ |
189 | /* | 191 | /* |
190 | * Add it to the swap cache and mark it dirty | 192 | * Add it to the swap cache and mark it dirty |
191 | */ | 193 | */ |
192 | err = add_to_swap_cache(page, entry, | 194 | err = add_to_swap_cache(page, entry, |
193 | __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN); | 195 | __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN); |
194 | 196 | ||
195 | if (!err) { /* Success */ | 197 | if (!err) { /* Success */ |
196 | SetPageDirty(page); | 198 | SetPageDirty(page); |
197 | return 1; | 199 | return 1; |
198 | } else { /* -ENOMEM radix-tree allocation failure */ | 200 | } else { /* -ENOMEM radix-tree allocation failure */ |
199 | /* | 201 | /* |
200 | * add_to_swap_cache() doesn't return -EEXIST, so we can safely | 202 | * add_to_swap_cache() doesn't return -EEXIST, so we can safely |
201 | * clear SWAP_HAS_CACHE flag. | 203 | * clear SWAP_HAS_CACHE flag. |
202 | */ | 204 | */ |
203 | swapcache_free(entry, NULL); | 205 | swapcache_free(entry, NULL); |
204 | return 0; | 206 | return 0; |
205 | } | 207 | } |
206 | } | 208 | } |
207 | 209 | ||
208 | /* | 210 | /* |
209 | * This must be called only on pages that have | 211 | * This must be called only on pages that have |
210 | * been verified to be in the swap cache and locked. | 212 | * been verified to be in the swap cache and locked. |
211 | * It will never put the page into the free list, | 213 | * It will never put the page into the free list, |
212 | * the caller has a reference on the page. | 214 | * the caller has a reference on the page. |
213 | */ | 215 | */ |
214 | void delete_from_swap_cache(struct page *page) | 216 | void delete_from_swap_cache(struct page *page) |
215 | { | 217 | { |
216 | swp_entry_t entry; | 218 | swp_entry_t entry; |
217 | struct address_space *address_space; | 219 | struct address_space *address_space; |
218 | 220 | ||
219 | entry.val = page_private(page); | 221 | entry.val = page_private(page); |
220 | 222 | ||
221 | address_space = swap_address_space(entry); | 223 | address_space = swap_address_space(entry); |
222 | spin_lock_irq(&address_space->tree_lock); | 224 | spin_lock_irq(&address_space->tree_lock); |
223 | __delete_from_swap_cache(page); | 225 | __delete_from_swap_cache(page); |
224 | spin_unlock_irq(&address_space->tree_lock); | 226 | spin_unlock_irq(&address_space->tree_lock); |
225 | 227 | ||
226 | swapcache_free(entry, page); | 228 | swapcache_free(entry, page); |
227 | page_cache_release(page); | 229 | page_cache_release(page); |
228 | } | 230 | } |
229 | 231 | ||
230 | /* | 232 | /* |
231 | * If we are the only user, then try to free up the swap cache. | 233 | * If we are the only user, then try to free up the swap cache. |
232 | * | 234 | * |
233 | * Its ok to check for PageSwapCache without the page lock | 235 | * Its ok to check for PageSwapCache without the page lock |
234 | * here because we are going to recheck again inside | 236 | * here because we are going to recheck again inside |
235 | * try_to_free_swap() _with_ the lock. | 237 | * try_to_free_swap() _with_ the lock. |
236 | * - Marcelo | 238 | * - Marcelo |
237 | */ | 239 | */ |
238 | static inline void free_swap_cache(struct page *page) | 240 | static inline void free_swap_cache(struct page *page) |
239 | { | 241 | { |
240 | if (PageSwapCache(page) && !page_mapped(page) && trylock_page(page)) { | 242 | if (PageSwapCache(page) && !page_mapped(page) && trylock_page(page)) { |
241 | try_to_free_swap(page); | 243 | try_to_free_swap(page); |
242 | unlock_page(page); | 244 | unlock_page(page); |
243 | } | 245 | } |
244 | } | 246 | } |
245 | 247 | ||
246 | /* | 248 | /* |
247 | * Perform a free_page(), also freeing any swap cache associated with | 249 | * Perform a free_page(), also freeing any swap cache associated with |
248 | * this page if it is the last user of the page. | 250 | * this page if it is the last user of the page. |
249 | */ | 251 | */ |
250 | void free_page_and_swap_cache(struct page *page) | 252 | void free_page_and_swap_cache(struct page *page) |
251 | { | 253 | { |
252 | free_swap_cache(page); | 254 | free_swap_cache(page); |
253 | page_cache_release(page); | 255 | page_cache_release(page); |
254 | } | 256 | } |
255 | 257 | ||
256 | /* | 258 | /* |
257 | * Passed an array of pages, drop them all from swapcache and then release | 259 | * Passed an array of pages, drop them all from swapcache and then release |
258 | * them. They are removed from the LRU and freed if this is their last use. | 260 | * them. They are removed from the LRU and freed if this is their last use. |
259 | */ | 261 | */ |
260 | void free_pages_and_swap_cache(struct page **pages, int nr) | 262 | void free_pages_and_swap_cache(struct page **pages, int nr) |
261 | { | 263 | { |
262 | struct page **pagep = pages; | 264 | struct page **pagep = pages; |
263 | 265 | ||
264 | lru_add_drain(); | 266 | lru_add_drain(); |
265 | while (nr) { | 267 | while (nr) { |
266 | int todo = min(nr, PAGEVEC_SIZE); | 268 | int todo = min(nr, PAGEVEC_SIZE); |
267 | int i; | 269 | int i; |
268 | 270 | ||
269 | for (i = 0; i < todo; i++) | 271 | for (i = 0; i < todo; i++) |
270 | free_swap_cache(pagep[i]); | 272 | free_swap_cache(pagep[i]); |
271 | release_pages(pagep, todo, 0); | 273 | release_pages(pagep, todo, 0); |
272 | pagep += todo; | 274 | pagep += todo; |
273 | nr -= todo; | 275 | nr -= todo; |
274 | } | 276 | } |
275 | } | 277 | } |
276 | 278 | ||
277 | /* | 279 | /* |
278 | * Lookup a swap entry in the swap cache. A found page will be returned | 280 | * Lookup a swap entry in the swap cache. A found page will be returned |
279 | * unlocked and with its refcount incremented - we rely on the kernel | 281 | * unlocked and with its refcount incremented - we rely on the kernel |
280 | * lock getting page table operations atomic even if we drop the page | 282 | * lock getting page table operations atomic even if we drop the page |
281 | * lock before returning. | 283 | * lock before returning. |
282 | */ | 284 | */ |
283 | struct page * lookup_swap_cache(swp_entry_t entry) | 285 | struct page * lookup_swap_cache(swp_entry_t entry) |
284 | { | 286 | { |
285 | struct page *page; | 287 | struct page *page; |
286 | 288 | ||
287 | page = find_get_page(swap_address_space(entry), entry.val); | 289 | page = find_get_page(swap_address_space(entry), entry.val); |
288 | 290 | ||
289 | if (page) | 291 | if (page) { |
290 | INC_CACHE_INFO(find_success); | 292 | INC_CACHE_INFO(find_success); |
293 | if (TestClearPageReadahead(page)) | ||
294 | atomic_inc(&swapin_readahead_hits); | ||
295 | } | ||
291 | 296 | ||
292 | INC_CACHE_INFO(find_total); | 297 | INC_CACHE_INFO(find_total); |
293 | return page; | 298 | return page; |
294 | } | 299 | } |
295 | 300 | ||
296 | /* | 301 | /* |
297 | * Locate a page of swap in physical memory, reserving swap cache space | 302 | * Locate a page of swap in physical memory, reserving swap cache space |
298 | * and reading the disk if it is not already cached. | 303 | * and reading the disk if it is not already cached. |
299 | * A failure return means that either the page allocation failed or that | 304 | * A failure return means that either the page allocation failed or that |
300 | * the swap entry is no longer in use. | 305 | * the swap entry is no longer in use. |
301 | */ | 306 | */ |
302 | struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, | 307 | struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, |
303 | struct vm_area_struct *vma, unsigned long addr) | 308 | struct vm_area_struct *vma, unsigned long addr) |
304 | { | 309 | { |
305 | struct page *found_page, *new_page = NULL; | 310 | struct page *found_page, *new_page = NULL; |
306 | int err; | 311 | int err; |
307 | 312 | ||
308 | do { | 313 | do { |
309 | /* | 314 | /* |
310 | * First check the swap cache. Since this is normally | 315 | * First check the swap cache. Since this is normally |
311 | * called after lookup_swap_cache() failed, re-calling | 316 | * called after lookup_swap_cache() failed, re-calling |
312 | * that would confuse statistics. | 317 | * that would confuse statistics. |
313 | */ | 318 | */ |
314 | found_page = find_get_page(swap_address_space(entry), | 319 | found_page = find_get_page(swap_address_space(entry), |
315 | entry.val); | 320 | entry.val); |
316 | if (found_page) | 321 | if (found_page) |
317 | break; | 322 | break; |
318 | 323 | ||
319 | /* | 324 | /* |
320 | * Get a new page to read into from swap. | 325 | * Get a new page to read into from swap. |
321 | */ | 326 | */ |
322 | if (!new_page) { | 327 | if (!new_page) { |
323 | new_page = alloc_page_vma(gfp_mask, vma, addr); | 328 | new_page = alloc_page_vma(gfp_mask, vma, addr); |
324 | if (!new_page) | 329 | if (!new_page) |
325 | break; /* Out of memory */ | 330 | break; /* Out of memory */ |
326 | } | 331 | } |
327 | 332 | ||
328 | /* | 333 | /* |
329 | * call radix_tree_preload() while we can wait. | 334 | * call radix_tree_preload() while we can wait. |
330 | */ | 335 | */ |
331 | err = radix_tree_maybe_preload(gfp_mask & GFP_KERNEL); | 336 | err = radix_tree_maybe_preload(gfp_mask & GFP_KERNEL); |
332 | if (err) | 337 | if (err) |
333 | break; | 338 | break; |
334 | 339 | ||
335 | /* | 340 | /* |
336 | * Swap entry may have been freed since our caller observed it. | 341 | * Swap entry may have been freed since our caller observed it. |
337 | */ | 342 | */ |
338 | err = swapcache_prepare(entry); | 343 | err = swapcache_prepare(entry); |
339 | if (err == -EEXIST) { | 344 | if (err == -EEXIST) { |
340 | radix_tree_preload_end(); | 345 | radix_tree_preload_end(); |
341 | /* | 346 | /* |
342 | * We might race against get_swap_page() and stumble | 347 | * We might race against get_swap_page() and stumble |
343 | * across a SWAP_HAS_CACHE swap_map entry whose page | 348 | * across a SWAP_HAS_CACHE swap_map entry whose page |
344 | * has not been brought into the swapcache yet, while | 349 | * has not been brought into the swapcache yet, while |
345 | * the other end is scheduled away waiting on discard | 350 | * the other end is scheduled away waiting on discard |
346 | * I/O completion at scan_swap_map(). | 351 | * I/O completion at scan_swap_map(). |
347 | * | 352 | * |
348 | * In order to avoid turning this transitory state | 353 | * In order to avoid turning this transitory state |
349 | * into a permanent loop around this -EEXIST case | 354 | * into a permanent loop around this -EEXIST case |
350 | * if !CONFIG_PREEMPT and the I/O completion happens | 355 | * if !CONFIG_PREEMPT and the I/O completion happens |
351 | * to be waiting on the CPU waitqueue where we are now | 356 | * to be waiting on the CPU waitqueue where we are now |
352 | * busy looping, we just conditionally invoke the | 357 | * busy looping, we just conditionally invoke the |
353 | * scheduler here, if there are some more important | 358 | * scheduler here, if there are some more important |
354 | * tasks to run. | 359 | * tasks to run. |
355 | */ | 360 | */ |
356 | cond_resched(); | 361 | cond_resched(); |
357 | continue; | 362 | continue; |
358 | } | 363 | } |
359 | if (err) { /* swp entry is obsolete ? */ | 364 | if (err) { /* swp entry is obsolete ? */ |
360 | radix_tree_preload_end(); | 365 | radix_tree_preload_end(); |
361 | break; | 366 | break; |
362 | } | 367 | } |
363 | 368 | ||
364 | /* May fail (-ENOMEM) if radix-tree node allocation failed. */ | 369 | /* May fail (-ENOMEM) if radix-tree node allocation failed. */ |
365 | __set_page_locked(new_page); | 370 | __set_page_locked(new_page); |
366 | SetPageSwapBacked(new_page); | 371 | SetPageSwapBacked(new_page); |
367 | err = __add_to_swap_cache(new_page, entry); | 372 | err = __add_to_swap_cache(new_page, entry); |
368 | if (likely(!err)) { | 373 | if (likely(!err)) { |
369 | radix_tree_preload_end(); | 374 | radix_tree_preload_end(); |
370 | /* | 375 | /* |
371 | * Initiate read into locked page and return. | 376 | * Initiate read into locked page and return. |
372 | */ | 377 | */ |
373 | lru_cache_add_anon(new_page); | 378 | lru_cache_add_anon(new_page); |
374 | swap_readpage(new_page); | 379 | swap_readpage(new_page); |
375 | return new_page; | 380 | return new_page; |
376 | } | 381 | } |
377 | radix_tree_preload_end(); | 382 | radix_tree_preload_end(); |
378 | ClearPageSwapBacked(new_page); | 383 | ClearPageSwapBacked(new_page); |
379 | __clear_page_locked(new_page); | 384 | __clear_page_locked(new_page); |
380 | /* | 385 | /* |
381 | * add_to_swap_cache() doesn't return -EEXIST, so we can safely | 386 | * add_to_swap_cache() doesn't return -EEXIST, so we can safely |
382 | * clear SWAP_HAS_CACHE flag. | 387 | * clear SWAP_HAS_CACHE flag. |
383 | */ | 388 | */ |
384 | swapcache_free(entry, NULL); | 389 | swapcache_free(entry, NULL); |
385 | } while (err != -ENOMEM); | 390 | } while (err != -ENOMEM); |
386 | 391 | ||
387 | if (new_page) | 392 | if (new_page) |
388 | page_cache_release(new_page); | 393 | page_cache_release(new_page); |
389 | return found_page; | 394 | return found_page; |
390 | } | 395 | } |
391 | 396 | ||
397 | static unsigned long swapin_nr_pages(unsigned long offset) | ||
398 | { | ||
399 | static unsigned long prev_offset; | ||
400 | unsigned int pages, max_pages, last_ra; | ||
401 | static atomic_t last_readahead_pages; | ||
402 | |||
403 | max_pages = 1 << ACCESS_ONCE(page_cluster); | ||
404 | if (max_pages <= 1) | ||
405 | return 1; | ||
406 | |||
407 | /* | ||
408 | * This heuristic has been found to work well on both sequential and | ||
409 | * random loads, swapping to hard disk or to SSD: please don't ask | ||
410 | * what the "+ 2" means, it just happens to work well, that's all. | ||
411 | */ | ||
412 | pages = atomic_xchg(&swapin_readahead_hits, 0) + 2; | ||
413 | if (pages == 2) { | ||
414 | /* | ||
415 | * We can have no readahead hits to judge by: but must not get | ||
416 | * stuck here forever, so check for an adjacent offset instead | ||
417 | * (and don't even bother to check whether swap type is same). | ||
418 | */ | ||
419 | if (offset != prev_offset + 1 && offset != prev_offset - 1) | ||
420 | pages = 1; | ||
421 | prev_offset = offset; | ||
422 | } else { | ||
423 | unsigned int roundup = 4; | ||
424 | while (roundup < pages) | ||
425 | roundup <<= 1; | ||
426 | pages = roundup; | ||
427 | } | ||
428 | |||
429 | if (pages > max_pages) | ||
430 | pages = max_pages; | ||
431 | |||
432 | /* Don't shrink readahead too fast */ | ||
433 | last_ra = atomic_read(&last_readahead_pages) / 2; | ||
434 | if (pages < last_ra) | ||
435 | pages = last_ra; | ||
436 | atomic_set(&last_readahead_pages, pages); | ||
437 | |||
438 | return pages; | ||
439 | } | ||
440 | |||
392 | /** | 441 | /** |
393 | * swapin_readahead - swap in pages in hope we need them soon | 442 | * swapin_readahead - swap in pages in hope we need them soon |
394 | * @entry: swap entry of this memory | 443 | * @entry: swap entry of this memory |
395 | * @gfp_mask: memory allocation flags | 444 | * @gfp_mask: memory allocation flags |
396 | * @vma: user vma this address belongs to | 445 | * @vma: user vma this address belongs to |
397 | * @addr: target address for mempolicy | 446 | * @addr: target address for mempolicy |
398 | * | 447 | * |
399 | * Returns the struct page for entry and addr, after queueing swapin. | 448 | * Returns the struct page for entry and addr, after queueing swapin. |
400 | * | 449 | * |
401 | * Primitive swap readahead code. We simply read an aligned block of | 450 | * Primitive swap readahead code. We simply read an aligned block of |
402 | * (1 << page_cluster) entries in the swap area. This method is chosen | 451 | * (1 << page_cluster) entries in the swap area. This method is chosen |
403 | * because it doesn't cost us any seek time. We also make sure to queue | 452 | * because it doesn't cost us any seek time. We also make sure to queue |
404 | * the 'original' request together with the readahead ones... | 453 | * the 'original' request together with the readahead ones... |
405 | * | 454 | * |
406 | * This has been extended to use the NUMA policies from the mm triggering | 455 | * This has been extended to use the NUMA policies from the mm triggering |
407 | * the readahead. | 456 | * the readahead. |
408 | * | 457 | * |
409 | * Caller must hold down_read on the vma->vm_mm if vma is not NULL. | 458 | * Caller must hold down_read on the vma->vm_mm if vma is not NULL. |
410 | */ | 459 | */ |
411 | struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, | 460 | struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, |
412 | struct vm_area_struct *vma, unsigned long addr) | 461 | struct vm_area_struct *vma, unsigned long addr) |
413 | { | 462 | { |
414 | struct page *page; | 463 | struct page *page; |
415 | unsigned long offset = swp_offset(entry); | 464 | unsigned long entry_offset = swp_offset(entry); |
465 | unsigned long offset = entry_offset; | ||
416 | unsigned long start_offset, end_offset; | 466 | unsigned long start_offset, end_offset; |
417 | unsigned long mask = (1UL << page_cluster) - 1; | 467 | unsigned long mask; |
418 | struct blk_plug plug; | 468 | struct blk_plug plug; |
419 | 469 | ||
470 | mask = swapin_nr_pages(offset) - 1; | ||
471 | if (!mask) | ||
472 | goto skip; | ||
473 | |||
420 | /* Read a page_cluster sized and aligned cluster around offset. */ | 474 | /* Read a page_cluster sized and aligned cluster around offset. */ |
421 | start_offset = offset & ~mask; | 475 | start_offset = offset & ~mask; |
422 | end_offset = offset | mask; | 476 | end_offset = offset | mask; |
423 | if (!start_offset) /* First page is swap header. */ | 477 | if (!start_offset) /* First page is swap header. */ |
424 | start_offset++; | 478 | start_offset++; |
425 | 479 | ||
426 | blk_start_plug(&plug); | 480 | blk_start_plug(&plug); |
427 | for (offset = start_offset; offset <= end_offset ; offset++) { | 481 | for (offset = start_offset; offset <= end_offset ; offset++) { |
428 | /* Ok, do the async read-ahead now */ | 482 | /* Ok, do the async read-ahead now */ |
429 | page = read_swap_cache_async(swp_entry(swp_type(entry), offset), | 483 | page = read_swap_cache_async(swp_entry(swp_type(entry), offset), |
430 | gfp_mask, vma, addr); | 484 | gfp_mask, vma, addr); |
431 | if (!page) | 485 | if (!page) |
432 | continue; | 486 | continue; |
487 | if (offset != entry_offset) | ||
488 | SetPageReadahead(page); | ||
433 | page_cache_release(page); | 489 | page_cache_release(page); |
434 | } | 490 | } |
435 | blk_finish_plug(&plug); | 491 | blk_finish_plug(&plug); |
436 | 492 | ||
437 | lru_add_drain(); /* Push any new pages onto the LRU now */ | 493 | lru_add_drain(); /* Push any new pages onto the LRU now */ |
494 | skip: | ||
438 | return read_swap_cache_async(entry, gfp_mask, vma, addr); | 495 | return read_swap_cache_async(entry, gfp_mask, vma, addr); |
439 | } | 496 | } |
440 | 497 |