Commit a41f24ea9fd6169b147c53c2392e2887cc1d9247

Authored by Nishanth Aravamudan
Committed by Linus Torvalds
1 parent ab857d0938

page allocator: smarter retry of costly-order allocations

Because of page order checks in __alloc_pages(), hugepage (and similarly
large order) allocations will not retry unless explicitly marked
__GFP_REPEAT. However, the current retry logic is nearly an infinite
loop (or until reclaim does no progress whatsoever). For these costly
allocations, that seems like overkill and could potentially never
terminate. Mel observed that allowing current __GFP_REPEAT semantics for
hugepage allocations essentially killed the system. I believe this is
because we may continue to reclaim small orders of pages all over, but
never have enough to satisfy the hugepage allocation request. This is
clearly only a problem for large order allocations, of which hugepages
are the most obvious (to me).

Modify try_to_free_pages() to indicate how many pages were reclaimed.
Use that information in __alloc_pages() to eventually fail a large
__GFP_REPEAT allocation when we've reclaimed an order of pages equal to
or greater than the allocation's order. This relies on lumpy reclaim
functioning as advertised. Due to fragmentation, lumpy reclaim may not
be able to free up the order needed in one invocation, so multiple
iterations may be requred. In other words, the more fragmented memory
is, the more retry attempts __GFP_REPEAT will make (particularly for
higher order allocations).

This changes the semantics of __GFP_REPEAT subtly, but *only* for
allocations > PAGE_ALLOC_COSTLY_ORDER. With this patch, for those size
allocations, we will try up to some point (at least 1<<order reclaimed
pages), rather than forever (which is the case for allocations <=
PAGE_ALLOC_COSTLY_ORDER).

This change improves the /proc/sys/vm/nr_hugepages interface with a
follow-on patch that makes pool allocations use __GFP_REPEAT. Rather
than administrators repeatedly echo'ing a particular value into the
sysctl, and forcing reclaim into action manually, this change allows for
the sysctl to attempt a reasonable effort itself. Similarly, dynamic
pool growth should be more successful under load, as lumpy reclaim can
try to free up pages, rather than failing right away.

Choosing to reclaim only up to the order of the requested allocation
strikes a balance between not failing hugepage allocations and returning
to the caller when it's unlikely to every succeed. Because of lumpy
reclaim, if we have freed the order requested, hopefully it has been in
big chunks and those chunks will allow our allocation to succeed. If
that isn't the case after freeing up the current order, I don't think it
is likely to succeed in the future, although it is possible given a
particular fragmentation pattern.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Tested-by: Mel Gorman <mel@csn.ul.ie>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Showing 2 changed files with 22 additions and 7 deletions Inline Diff

1 /* 1 /*
2 * linux/mm/page_alloc.c 2 * linux/mm/page_alloc.c
3 * 3 *
4 * Manages the free list, the system allocates free pages here. 4 * Manages the free list, the system allocates free pages here.
5 * Note that kmalloc() lives in slab.c 5 * Note that kmalloc() lives in slab.c
6 * 6 *
7 * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds 7 * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds
8 * Swap reorganised 29.12.95, Stephen Tweedie 8 * Swap reorganised 29.12.95, Stephen Tweedie
9 * Support of BIGMEM added by Gerhard Wichert, Siemens AG, July 1999 9 * Support of BIGMEM added by Gerhard Wichert, Siemens AG, July 1999
10 * Reshaped it to be a zoned allocator, Ingo Molnar, Red Hat, 1999 10 * Reshaped it to be a zoned allocator, Ingo Molnar, Red Hat, 1999
11 * Discontiguous memory support, Kanoj Sarcar, SGI, Nov 1999 11 * Discontiguous memory support, Kanoj Sarcar, SGI, Nov 1999
12 * Zone balancing, Kanoj Sarcar, SGI, Jan 2000 12 * Zone balancing, Kanoj Sarcar, SGI, Jan 2000
13 * Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002 13 * Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002
14 * (lots of bits borrowed from Ingo Molnar & Andrew Morton) 14 * (lots of bits borrowed from Ingo Molnar & Andrew Morton)
15 */ 15 */
16 16
17 #include <linux/stddef.h> 17 #include <linux/stddef.h>
18 #include <linux/mm.h> 18 #include <linux/mm.h>
19 #include <linux/swap.h> 19 #include <linux/swap.h>
20 #include <linux/interrupt.h> 20 #include <linux/interrupt.h>
21 #include <linux/pagemap.h> 21 #include <linux/pagemap.h>
22 #include <linux/jiffies.h> 22 #include <linux/jiffies.h>
23 #include <linux/bootmem.h> 23 #include <linux/bootmem.h>
24 #include <linux/compiler.h> 24 #include <linux/compiler.h>
25 #include <linux/kernel.h> 25 #include <linux/kernel.h>
26 #include <linux/module.h> 26 #include <linux/module.h>
27 #include <linux/suspend.h> 27 #include <linux/suspend.h>
28 #include <linux/pagevec.h> 28 #include <linux/pagevec.h>
29 #include <linux/blkdev.h> 29 #include <linux/blkdev.h>
30 #include <linux/slab.h> 30 #include <linux/slab.h>
31 #include <linux/oom.h> 31 #include <linux/oom.h>
32 #include <linux/notifier.h> 32 #include <linux/notifier.h>
33 #include <linux/topology.h> 33 #include <linux/topology.h>
34 #include <linux/sysctl.h> 34 #include <linux/sysctl.h>
35 #include <linux/cpu.h> 35 #include <linux/cpu.h>
36 #include <linux/cpuset.h> 36 #include <linux/cpuset.h>
37 #include <linux/memory_hotplug.h> 37 #include <linux/memory_hotplug.h>
38 #include <linux/nodemask.h> 38 #include <linux/nodemask.h>
39 #include <linux/vmalloc.h> 39 #include <linux/vmalloc.h>
40 #include <linux/mempolicy.h> 40 #include <linux/mempolicy.h>
41 #include <linux/stop_machine.h> 41 #include <linux/stop_machine.h>
42 #include <linux/sort.h> 42 #include <linux/sort.h>
43 #include <linux/pfn.h> 43 #include <linux/pfn.h>
44 #include <linux/backing-dev.h> 44 #include <linux/backing-dev.h>
45 #include <linux/fault-inject.h> 45 #include <linux/fault-inject.h>
46 #include <linux/page-isolation.h> 46 #include <linux/page-isolation.h>
47 #include <linux/memcontrol.h> 47 #include <linux/memcontrol.h>
48 48
49 #include <asm/tlbflush.h> 49 #include <asm/tlbflush.h>
50 #include <asm/div64.h> 50 #include <asm/div64.h>
51 #include "internal.h" 51 #include "internal.h"
52 52
53 /* 53 /*
54 * Array of node states. 54 * Array of node states.
55 */ 55 */
56 nodemask_t node_states[NR_NODE_STATES] __read_mostly = { 56 nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
57 [N_POSSIBLE] = NODE_MASK_ALL, 57 [N_POSSIBLE] = NODE_MASK_ALL,
58 [N_ONLINE] = { { [0] = 1UL } }, 58 [N_ONLINE] = { { [0] = 1UL } },
59 #ifndef CONFIG_NUMA 59 #ifndef CONFIG_NUMA
60 [N_NORMAL_MEMORY] = { { [0] = 1UL } }, 60 [N_NORMAL_MEMORY] = { { [0] = 1UL } },
61 #ifdef CONFIG_HIGHMEM 61 #ifdef CONFIG_HIGHMEM
62 [N_HIGH_MEMORY] = { { [0] = 1UL } }, 62 [N_HIGH_MEMORY] = { { [0] = 1UL } },
63 #endif 63 #endif
64 [N_CPU] = { { [0] = 1UL } }, 64 [N_CPU] = { { [0] = 1UL } },
65 #endif /* NUMA */ 65 #endif /* NUMA */
66 }; 66 };
67 EXPORT_SYMBOL(node_states); 67 EXPORT_SYMBOL(node_states);
68 68
69 unsigned long totalram_pages __read_mostly; 69 unsigned long totalram_pages __read_mostly;
70 unsigned long totalreserve_pages __read_mostly; 70 unsigned long totalreserve_pages __read_mostly;
71 long nr_swap_pages; 71 long nr_swap_pages;
72 int percpu_pagelist_fraction; 72 int percpu_pagelist_fraction;
73 73
74 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE 74 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
75 int pageblock_order __read_mostly; 75 int pageblock_order __read_mostly;
76 #endif 76 #endif
77 77
78 static void __free_pages_ok(struct page *page, unsigned int order); 78 static void __free_pages_ok(struct page *page, unsigned int order);
79 79
80 /* 80 /*
81 * results with 256, 32 in the lowmem_reserve sysctl: 81 * results with 256, 32 in the lowmem_reserve sysctl:
82 * 1G machine -> (16M dma, 800M-16M normal, 1G-800M high) 82 * 1G machine -> (16M dma, 800M-16M normal, 1G-800M high)
83 * 1G machine -> (16M dma, 784M normal, 224M high) 83 * 1G machine -> (16M dma, 784M normal, 224M high)
84 * NORMAL allocation will leave 784M/256 of ram reserved in the ZONE_DMA 84 * NORMAL allocation will leave 784M/256 of ram reserved in the ZONE_DMA
85 * HIGHMEM allocation will leave 224M/32 of ram reserved in ZONE_NORMAL 85 * HIGHMEM allocation will leave 224M/32 of ram reserved in ZONE_NORMAL
86 * HIGHMEM allocation will (224M+784M)/256 of ram reserved in ZONE_DMA 86 * HIGHMEM allocation will (224M+784M)/256 of ram reserved in ZONE_DMA
87 * 87 *
88 * TBD: should special case ZONE_DMA32 machines here - in those we normally 88 * TBD: should special case ZONE_DMA32 machines here - in those we normally
89 * don't need any ZONE_NORMAL reservation 89 * don't need any ZONE_NORMAL reservation
90 */ 90 */
91 int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = { 91 int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = {
92 #ifdef CONFIG_ZONE_DMA 92 #ifdef CONFIG_ZONE_DMA
93 256, 93 256,
94 #endif 94 #endif
95 #ifdef CONFIG_ZONE_DMA32 95 #ifdef CONFIG_ZONE_DMA32
96 256, 96 256,
97 #endif 97 #endif
98 #ifdef CONFIG_HIGHMEM 98 #ifdef CONFIG_HIGHMEM
99 32, 99 32,
100 #endif 100 #endif
101 32, 101 32,
102 }; 102 };
103 103
104 EXPORT_SYMBOL(totalram_pages); 104 EXPORT_SYMBOL(totalram_pages);
105 105
106 static char * const zone_names[MAX_NR_ZONES] = { 106 static char * const zone_names[MAX_NR_ZONES] = {
107 #ifdef CONFIG_ZONE_DMA 107 #ifdef CONFIG_ZONE_DMA
108 "DMA", 108 "DMA",
109 #endif 109 #endif
110 #ifdef CONFIG_ZONE_DMA32 110 #ifdef CONFIG_ZONE_DMA32
111 "DMA32", 111 "DMA32",
112 #endif 112 #endif
113 "Normal", 113 "Normal",
114 #ifdef CONFIG_HIGHMEM 114 #ifdef CONFIG_HIGHMEM
115 "HighMem", 115 "HighMem",
116 #endif 116 #endif
117 "Movable", 117 "Movable",
118 }; 118 };
119 119
120 int min_free_kbytes = 1024; 120 int min_free_kbytes = 1024;
121 121
122 unsigned long __meminitdata nr_kernel_pages; 122 unsigned long __meminitdata nr_kernel_pages;
123 unsigned long __meminitdata nr_all_pages; 123 unsigned long __meminitdata nr_all_pages;
124 static unsigned long __meminitdata dma_reserve; 124 static unsigned long __meminitdata dma_reserve;
125 125
126 #ifdef CONFIG_ARCH_POPULATES_NODE_MAP 126 #ifdef CONFIG_ARCH_POPULATES_NODE_MAP
127 /* 127 /*
128 * MAX_ACTIVE_REGIONS determines the maximum number of distinct 128 * MAX_ACTIVE_REGIONS determines the maximum number of distinct
129 * ranges of memory (RAM) that may be registered with add_active_range(). 129 * ranges of memory (RAM) that may be registered with add_active_range().
130 * Ranges passed to add_active_range() will be merged if possible 130 * Ranges passed to add_active_range() will be merged if possible
131 * so the number of times add_active_range() can be called is 131 * so the number of times add_active_range() can be called is
132 * related to the number of nodes and the number of holes 132 * related to the number of nodes and the number of holes
133 */ 133 */
134 #ifdef CONFIG_MAX_ACTIVE_REGIONS 134 #ifdef CONFIG_MAX_ACTIVE_REGIONS
135 /* Allow an architecture to set MAX_ACTIVE_REGIONS to save memory */ 135 /* Allow an architecture to set MAX_ACTIVE_REGIONS to save memory */
136 #define MAX_ACTIVE_REGIONS CONFIG_MAX_ACTIVE_REGIONS 136 #define MAX_ACTIVE_REGIONS CONFIG_MAX_ACTIVE_REGIONS
137 #else 137 #else
138 #if MAX_NUMNODES >= 32 138 #if MAX_NUMNODES >= 32
139 /* If there can be many nodes, allow up to 50 holes per node */ 139 /* If there can be many nodes, allow up to 50 holes per node */
140 #define MAX_ACTIVE_REGIONS (MAX_NUMNODES*50) 140 #define MAX_ACTIVE_REGIONS (MAX_NUMNODES*50)
141 #else 141 #else
142 /* By default, allow up to 256 distinct regions */ 142 /* By default, allow up to 256 distinct regions */
143 #define MAX_ACTIVE_REGIONS 256 143 #define MAX_ACTIVE_REGIONS 256
144 #endif 144 #endif
145 #endif 145 #endif
146 146
147 static struct node_active_region __meminitdata early_node_map[MAX_ACTIVE_REGIONS]; 147 static struct node_active_region __meminitdata early_node_map[MAX_ACTIVE_REGIONS];
148 static int __meminitdata nr_nodemap_entries; 148 static int __meminitdata nr_nodemap_entries;
149 static unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES]; 149 static unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
150 static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES]; 150 static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
151 #ifdef CONFIG_MEMORY_HOTPLUG_RESERVE 151 #ifdef CONFIG_MEMORY_HOTPLUG_RESERVE
152 static unsigned long __meminitdata node_boundary_start_pfn[MAX_NUMNODES]; 152 static unsigned long __meminitdata node_boundary_start_pfn[MAX_NUMNODES];
153 static unsigned long __meminitdata node_boundary_end_pfn[MAX_NUMNODES]; 153 static unsigned long __meminitdata node_boundary_end_pfn[MAX_NUMNODES];
154 #endif /* CONFIG_MEMORY_HOTPLUG_RESERVE */ 154 #endif /* CONFIG_MEMORY_HOTPLUG_RESERVE */
155 unsigned long __initdata required_kernelcore; 155 unsigned long __initdata required_kernelcore;
156 static unsigned long __initdata required_movablecore; 156 static unsigned long __initdata required_movablecore;
157 unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES]; 157 unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
158 158
159 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */ 159 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
160 int movable_zone; 160 int movable_zone;
161 EXPORT_SYMBOL(movable_zone); 161 EXPORT_SYMBOL(movable_zone);
162 #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */ 162 #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
163 163
164 #if MAX_NUMNODES > 1 164 #if MAX_NUMNODES > 1
165 int nr_node_ids __read_mostly = MAX_NUMNODES; 165 int nr_node_ids __read_mostly = MAX_NUMNODES;
166 EXPORT_SYMBOL(nr_node_ids); 166 EXPORT_SYMBOL(nr_node_ids);
167 #endif 167 #endif
168 168
169 int page_group_by_mobility_disabled __read_mostly; 169 int page_group_by_mobility_disabled __read_mostly;
170 170
171 static void set_pageblock_migratetype(struct page *page, int migratetype) 171 static void set_pageblock_migratetype(struct page *page, int migratetype)
172 { 172 {
173 set_pageblock_flags_group(page, (unsigned long)migratetype, 173 set_pageblock_flags_group(page, (unsigned long)migratetype,
174 PB_migrate, PB_migrate_end); 174 PB_migrate, PB_migrate_end);
175 } 175 }
176 176
177 #ifdef CONFIG_DEBUG_VM 177 #ifdef CONFIG_DEBUG_VM
178 static int page_outside_zone_boundaries(struct zone *zone, struct page *page) 178 static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
179 { 179 {
180 int ret = 0; 180 int ret = 0;
181 unsigned seq; 181 unsigned seq;
182 unsigned long pfn = page_to_pfn(page); 182 unsigned long pfn = page_to_pfn(page);
183 183
184 do { 184 do {
185 seq = zone_span_seqbegin(zone); 185 seq = zone_span_seqbegin(zone);
186 if (pfn >= zone->zone_start_pfn + zone->spanned_pages) 186 if (pfn >= zone->zone_start_pfn + zone->spanned_pages)
187 ret = 1; 187 ret = 1;
188 else if (pfn < zone->zone_start_pfn) 188 else if (pfn < zone->zone_start_pfn)
189 ret = 1; 189 ret = 1;
190 } while (zone_span_seqretry(zone, seq)); 190 } while (zone_span_seqretry(zone, seq));
191 191
192 return ret; 192 return ret;
193 } 193 }
194 194
195 static int page_is_consistent(struct zone *zone, struct page *page) 195 static int page_is_consistent(struct zone *zone, struct page *page)
196 { 196 {
197 if (!pfn_valid_within(page_to_pfn(page))) 197 if (!pfn_valid_within(page_to_pfn(page)))
198 return 0; 198 return 0;
199 if (zone != page_zone(page)) 199 if (zone != page_zone(page))
200 return 0; 200 return 0;
201 201
202 return 1; 202 return 1;
203 } 203 }
204 /* 204 /*
205 * Temporary debugging check for pages not lying within a given zone. 205 * Temporary debugging check for pages not lying within a given zone.
206 */ 206 */
207 static int bad_range(struct zone *zone, struct page *page) 207 static int bad_range(struct zone *zone, struct page *page)
208 { 208 {
209 if (page_outside_zone_boundaries(zone, page)) 209 if (page_outside_zone_boundaries(zone, page))
210 return 1; 210 return 1;
211 if (!page_is_consistent(zone, page)) 211 if (!page_is_consistent(zone, page))
212 return 1; 212 return 1;
213 213
214 return 0; 214 return 0;
215 } 215 }
216 #else 216 #else
217 static inline int bad_range(struct zone *zone, struct page *page) 217 static inline int bad_range(struct zone *zone, struct page *page)
218 { 218 {
219 return 0; 219 return 0;
220 } 220 }
221 #endif 221 #endif
222 222
223 static void bad_page(struct page *page) 223 static void bad_page(struct page *page)
224 { 224 {
225 void *pc = page_get_page_cgroup(page); 225 void *pc = page_get_page_cgroup(page);
226 226
227 printk(KERN_EMERG "Bad page state in process '%s'\n" KERN_EMERG 227 printk(KERN_EMERG "Bad page state in process '%s'\n" KERN_EMERG
228 "page:%p flags:0x%0*lx mapping:%p mapcount:%d count:%d\n", 228 "page:%p flags:0x%0*lx mapping:%p mapcount:%d count:%d\n",
229 current->comm, page, (int)(2*sizeof(unsigned long)), 229 current->comm, page, (int)(2*sizeof(unsigned long)),
230 (unsigned long)page->flags, page->mapping, 230 (unsigned long)page->flags, page->mapping,
231 page_mapcount(page), page_count(page)); 231 page_mapcount(page), page_count(page));
232 if (pc) { 232 if (pc) {
233 printk(KERN_EMERG "cgroup:%p\n", pc); 233 printk(KERN_EMERG "cgroup:%p\n", pc);
234 page_reset_bad_cgroup(page); 234 page_reset_bad_cgroup(page);
235 } 235 }
236 printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n" 236 printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n"
237 KERN_EMERG "Backtrace:\n"); 237 KERN_EMERG "Backtrace:\n");
238 dump_stack(); 238 dump_stack();
239 page->flags &= ~(1 << PG_lru | 239 page->flags &= ~(1 << PG_lru |
240 1 << PG_private | 240 1 << PG_private |
241 1 << PG_locked | 241 1 << PG_locked |
242 1 << PG_active | 242 1 << PG_active |
243 1 << PG_dirty | 243 1 << PG_dirty |
244 1 << PG_reclaim | 244 1 << PG_reclaim |
245 1 << PG_slab | 245 1 << PG_slab |
246 1 << PG_swapcache | 246 1 << PG_swapcache |
247 1 << PG_writeback | 247 1 << PG_writeback |
248 1 << PG_buddy ); 248 1 << PG_buddy );
249 set_page_count(page, 0); 249 set_page_count(page, 0);
250 reset_page_mapcount(page); 250 reset_page_mapcount(page);
251 page->mapping = NULL; 251 page->mapping = NULL;
252 add_taint(TAINT_BAD_PAGE); 252 add_taint(TAINT_BAD_PAGE);
253 } 253 }
254 254
255 /* 255 /*
256 * Higher-order pages are called "compound pages". They are structured thusly: 256 * Higher-order pages are called "compound pages". They are structured thusly:
257 * 257 *
258 * The first PAGE_SIZE page is called the "head page". 258 * The first PAGE_SIZE page is called the "head page".
259 * 259 *
260 * The remaining PAGE_SIZE pages are called "tail pages". 260 * The remaining PAGE_SIZE pages are called "tail pages".
261 * 261 *
262 * All pages have PG_compound set. All pages have their ->private pointing at 262 * All pages have PG_compound set. All pages have their ->private pointing at
263 * the head page (even the head page has this). 263 * the head page (even the head page has this).
264 * 264 *
265 * The first tail page's ->lru.next holds the address of the compound page's 265 * The first tail page's ->lru.next holds the address of the compound page's
266 * put_page() function. Its ->lru.prev holds the order of allocation. 266 * put_page() function. Its ->lru.prev holds the order of allocation.
267 * This usage means that zero-order pages may not be compound. 267 * This usage means that zero-order pages may not be compound.
268 */ 268 */
269 269
270 static void free_compound_page(struct page *page) 270 static void free_compound_page(struct page *page)
271 { 271 {
272 __free_pages_ok(page, compound_order(page)); 272 __free_pages_ok(page, compound_order(page));
273 } 273 }
274 274
275 static void prep_compound_page(struct page *page, unsigned long order) 275 static void prep_compound_page(struct page *page, unsigned long order)
276 { 276 {
277 int i; 277 int i;
278 int nr_pages = 1 << order; 278 int nr_pages = 1 << order;
279 279
280 set_compound_page_dtor(page, free_compound_page); 280 set_compound_page_dtor(page, free_compound_page);
281 set_compound_order(page, order); 281 set_compound_order(page, order);
282 __SetPageHead(page); 282 __SetPageHead(page);
283 for (i = 1; i < nr_pages; i++) { 283 for (i = 1; i < nr_pages; i++) {
284 struct page *p = page + i; 284 struct page *p = page + i;
285 285
286 __SetPageTail(p); 286 __SetPageTail(p);
287 p->first_page = page; 287 p->first_page = page;
288 } 288 }
289 } 289 }
290 290
291 static void destroy_compound_page(struct page *page, unsigned long order) 291 static void destroy_compound_page(struct page *page, unsigned long order)
292 { 292 {
293 int i; 293 int i;
294 int nr_pages = 1 << order; 294 int nr_pages = 1 << order;
295 295
296 if (unlikely(compound_order(page) != order)) 296 if (unlikely(compound_order(page) != order))
297 bad_page(page); 297 bad_page(page);
298 298
299 if (unlikely(!PageHead(page))) 299 if (unlikely(!PageHead(page)))
300 bad_page(page); 300 bad_page(page);
301 __ClearPageHead(page); 301 __ClearPageHead(page);
302 for (i = 1; i < nr_pages; i++) { 302 for (i = 1; i < nr_pages; i++) {
303 struct page *p = page + i; 303 struct page *p = page + i;
304 304
305 if (unlikely(!PageTail(p) | 305 if (unlikely(!PageTail(p) |
306 (p->first_page != page))) 306 (p->first_page != page)))
307 bad_page(page); 307 bad_page(page);
308 __ClearPageTail(p); 308 __ClearPageTail(p);
309 } 309 }
310 } 310 }
311 311
312 static inline void prep_zero_page(struct page *page, int order, gfp_t gfp_flags) 312 static inline void prep_zero_page(struct page *page, int order, gfp_t gfp_flags)
313 { 313 {
314 int i; 314 int i;
315 315
316 /* 316 /*
317 * clear_highpage() will use KM_USER0, so it's a bug to use __GFP_ZERO 317 * clear_highpage() will use KM_USER0, so it's a bug to use __GFP_ZERO
318 * and __GFP_HIGHMEM from hard or soft interrupt context. 318 * and __GFP_HIGHMEM from hard or soft interrupt context.
319 */ 319 */
320 VM_BUG_ON((gfp_flags & __GFP_HIGHMEM) && in_interrupt()); 320 VM_BUG_ON((gfp_flags & __GFP_HIGHMEM) && in_interrupt());
321 for (i = 0; i < (1 << order); i++) 321 for (i = 0; i < (1 << order); i++)
322 clear_highpage(page + i); 322 clear_highpage(page + i);
323 } 323 }
324 324
325 static inline void set_page_order(struct page *page, int order) 325 static inline void set_page_order(struct page *page, int order)
326 { 326 {
327 set_page_private(page, order); 327 set_page_private(page, order);
328 __SetPageBuddy(page); 328 __SetPageBuddy(page);
329 } 329 }
330 330
331 static inline void rmv_page_order(struct page *page) 331 static inline void rmv_page_order(struct page *page)
332 { 332 {
333 __ClearPageBuddy(page); 333 __ClearPageBuddy(page);
334 set_page_private(page, 0); 334 set_page_private(page, 0);
335 } 335 }
336 336
337 /* 337 /*
338 * Locate the struct page for both the matching buddy in our 338 * Locate the struct page for both the matching buddy in our
339 * pair (buddy1) and the combined O(n+1) page they form (page). 339 * pair (buddy1) and the combined O(n+1) page they form (page).
340 * 340 *
341 * 1) Any buddy B1 will have an order O twin B2 which satisfies 341 * 1) Any buddy B1 will have an order O twin B2 which satisfies
342 * the following equation: 342 * the following equation:
343 * B2 = B1 ^ (1 << O) 343 * B2 = B1 ^ (1 << O)
344 * For example, if the starting buddy (buddy2) is #8 its order 344 * For example, if the starting buddy (buddy2) is #8 its order
345 * 1 buddy is #10: 345 * 1 buddy is #10:
346 * B2 = 8 ^ (1 << 1) = 8 ^ 2 = 10 346 * B2 = 8 ^ (1 << 1) = 8 ^ 2 = 10
347 * 347 *
348 * 2) Any buddy B will have an order O+1 parent P which 348 * 2) Any buddy B will have an order O+1 parent P which
349 * satisfies the following equation: 349 * satisfies the following equation:
350 * P = B & ~(1 << O) 350 * P = B & ~(1 << O)
351 * 351 *
352 * Assumption: *_mem_map is contiguous at least up to MAX_ORDER 352 * Assumption: *_mem_map is contiguous at least up to MAX_ORDER
353 */ 353 */
354 static inline struct page * 354 static inline struct page *
355 __page_find_buddy(struct page *page, unsigned long page_idx, unsigned int order) 355 __page_find_buddy(struct page *page, unsigned long page_idx, unsigned int order)
356 { 356 {
357 unsigned long buddy_idx = page_idx ^ (1 << order); 357 unsigned long buddy_idx = page_idx ^ (1 << order);
358 358
359 return page + (buddy_idx - page_idx); 359 return page + (buddy_idx - page_idx);
360 } 360 }
361 361
362 static inline unsigned long 362 static inline unsigned long
363 __find_combined_index(unsigned long page_idx, unsigned int order) 363 __find_combined_index(unsigned long page_idx, unsigned int order)
364 { 364 {
365 return (page_idx & ~(1 << order)); 365 return (page_idx & ~(1 << order));
366 } 366 }
367 367
368 /* 368 /*
369 * This function checks whether a page is free && is the buddy 369 * This function checks whether a page is free && is the buddy
370 * we can do coalesce a page and its buddy if 370 * we can do coalesce a page and its buddy if
371 * (a) the buddy is not in a hole && 371 * (a) the buddy is not in a hole &&
372 * (b) the buddy is in the buddy system && 372 * (b) the buddy is in the buddy system &&
373 * (c) a page and its buddy have the same order && 373 * (c) a page and its buddy have the same order &&
374 * (d) a page and its buddy are in the same zone. 374 * (d) a page and its buddy are in the same zone.
375 * 375 *
376 * For recording whether a page is in the buddy system, we use PG_buddy. 376 * For recording whether a page is in the buddy system, we use PG_buddy.
377 * Setting, clearing, and testing PG_buddy is serialized by zone->lock. 377 * Setting, clearing, and testing PG_buddy is serialized by zone->lock.
378 * 378 *
379 * For recording page's order, we use page_private(page). 379 * For recording page's order, we use page_private(page).
380 */ 380 */
381 static inline int page_is_buddy(struct page *page, struct page *buddy, 381 static inline int page_is_buddy(struct page *page, struct page *buddy,
382 int order) 382 int order)
383 { 383 {
384 if (!pfn_valid_within(page_to_pfn(buddy))) 384 if (!pfn_valid_within(page_to_pfn(buddy)))
385 return 0; 385 return 0;
386 386
387 if (page_zone_id(page) != page_zone_id(buddy)) 387 if (page_zone_id(page) != page_zone_id(buddy))
388 return 0; 388 return 0;
389 389
390 if (PageBuddy(buddy) && page_order(buddy) == order) { 390 if (PageBuddy(buddy) && page_order(buddy) == order) {
391 BUG_ON(page_count(buddy) != 0); 391 BUG_ON(page_count(buddy) != 0);
392 return 1; 392 return 1;
393 } 393 }
394 return 0; 394 return 0;
395 } 395 }
396 396
397 /* 397 /*
398 * Freeing function for a buddy system allocator. 398 * Freeing function for a buddy system allocator.
399 * 399 *
400 * The concept of a buddy system is to maintain direct-mapped table 400 * The concept of a buddy system is to maintain direct-mapped table
401 * (containing bit values) for memory blocks of various "orders". 401 * (containing bit values) for memory blocks of various "orders".
402 * The bottom level table contains the map for the smallest allocatable 402 * The bottom level table contains the map for the smallest allocatable
403 * units of memory (here, pages), and each level above it describes 403 * units of memory (here, pages), and each level above it describes
404 * pairs of units from the levels below, hence, "buddies". 404 * pairs of units from the levels below, hence, "buddies".
405 * At a high level, all that happens here is marking the table entry 405 * At a high level, all that happens here is marking the table entry
406 * at the bottom level available, and propagating the changes upward 406 * at the bottom level available, and propagating the changes upward
407 * as necessary, plus some accounting needed to play nicely with other 407 * as necessary, plus some accounting needed to play nicely with other
408 * parts of the VM system. 408 * parts of the VM system.
409 * At each level, we keep a list of pages, which are heads of continuous 409 * At each level, we keep a list of pages, which are heads of continuous
410 * free pages of length of (1 << order) and marked with PG_buddy. Page's 410 * free pages of length of (1 << order) and marked with PG_buddy. Page's
411 * order is recorded in page_private(page) field. 411 * order is recorded in page_private(page) field.
412 * So when we are allocating or freeing one, we can derive the state of the 412 * So when we are allocating or freeing one, we can derive the state of the
413 * other. That is, if we allocate a small block, and both were 413 * other. That is, if we allocate a small block, and both were
414 * free, the remainder of the region must be split into blocks. 414 * free, the remainder of the region must be split into blocks.
415 * If a block is freed, and its buddy is also free, then this 415 * If a block is freed, and its buddy is also free, then this
416 * triggers coalescing into a block of larger size. 416 * triggers coalescing into a block of larger size.
417 * 417 *
418 * -- wli 418 * -- wli
419 */ 419 */
420 420
421 static inline void __free_one_page(struct page *page, 421 static inline void __free_one_page(struct page *page,
422 struct zone *zone, unsigned int order) 422 struct zone *zone, unsigned int order)
423 { 423 {
424 unsigned long page_idx; 424 unsigned long page_idx;
425 int order_size = 1 << order; 425 int order_size = 1 << order;
426 int migratetype = get_pageblock_migratetype(page); 426 int migratetype = get_pageblock_migratetype(page);
427 427
428 if (unlikely(PageCompound(page))) 428 if (unlikely(PageCompound(page)))
429 destroy_compound_page(page, order); 429 destroy_compound_page(page, order);
430 430
431 page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1); 431 page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
432 432
433 VM_BUG_ON(page_idx & (order_size - 1)); 433 VM_BUG_ON(page_idx & (order_size - 1));
434 VM_BUG_ON(bad_range(zone, page)); 434 VM_BUG_ON(bad_range(zone, page));
435 435
436 __mod_zone_page_state(zone, NR_FREE_PAGES, order_size); 436 __mod_zone_page_state(zone, NR_FREE_PAGES, order_size);
437 while (order < MAX_ORDER-1) { 437 while (order < MAX_ORDER-1) {
438 unsigned long combined_idx; 438 unsigned long combined_idx;
439 struct page *buddy; 439 struct page *buddy;
440 440
441 buddy = __page_find_buddy(page, page_idx, order); 441 buddy = __page_find_buddy(page, page_idx, order);
442 if (!page_is_buddy(page, buddy, order)) 442 if (!page_is_buddy(page, buddy, order))
443 break; /* Move the buddy up one level. */ 443 break; /* Move the buddy up one level. */
444 444
445 list_del(&buddy->lru); 445 list_del(&buddy->lru);
446 zone->free_area[order].nr_free--; 446 zone->free_area[order].nr_free--;
447 rmv_page_order(buddy); 447 rmv_page_order(buddy);
448 combined_idx = __find_combined_index(page_idx, order); 448 combined_idx = __find_combined_index(page_idx, order);
449 page = page + (combined_idx - page_idx); 449 page = page + (combined_idx - page_idx);
450 page_idx = combined_idx; 450 page_idx = combined_idx;
451 order++; 451 order++;
452 } 452 }
453 set_page_order(page, order); 453 set_page_order(page, order);
454 list_add(&page->lru, 454 list_add(&page->lru,
455 &zone->free_area[order].free_list[migratetype]); 455 &zone->free_area[order].free_list[migratetype]);
456 zone->free_area[order].nr_free++; 456 zone->free_area[order].nr_free++;
457 } 457 }
458 458
459 static inline int free_pages_check(struct page *page) 459 static inline int free_pages_check(struct page *page)
460 { 460 {
461 if (unlikely(page_mapcount(page) | 461 if (unlikely(page_mapcount(page) |
462 (page->mapping != NULL) | 462 (page->mapping != NULL) |
463 (page_get_page_cgroup(page) != NULL) | 463 (page_get_page_cgroup(page) != NULL) |
464 (page_count(page) != 0) | 464 (page_count(page) != 0) |
465 (page->flags & ( 465 (page->flags & (
466 1 << PG_lru | 466 1 << PG_lru |
467 1 << PG_private | 467 1 << PG_private |
468 1 << PG_locked | 468 1 << PG_locked |
469 1 << PG_active | 469 1 << PG_active |
470 1 << PG_slab | 470 1 << PG_slab |
471 1 << PG_swapcache | 471 1 << PG_swapcache |
472 1 << PG_writeback | 472 1 << PG_writeback |
473 1 << PG_reserved | 473 1 << PG_reserved |
474 1 << PG_buddy )))) 474 1 << PG_buddy ))))
475 bad_page(page); 475 bad_page(page);
476 if (PageDirty(page)) 476 if (PageDirty(page))
477 __ClearPageDirty(page); 477 __ClearPageDirty(page);
478 /* 478 /*
479 * For now, we report if PG_reserved was found set, but do not 479 * For now, we report if PG_reserved was found set, but do not
480 * clear it, and do not free the page. But we shall soon need 480 * clear it, and do not free the page. But we shall soon need
481 * to do more, for when the ZERO_PAGE count wraps negative. 481 * to do more, for when the ZERO_PAGE count wraps negative.
482 */ 482 */
483 return PageReserved(page); 483 return PageReserved(page);
484 } 484 }
485 485
486 /* 486 /*
487 * Frees a list of pages. 487 * Frees a list of pages.
488 * Assumes all pages on list are in same zone, and of same order. 488 * Assumes all pages on list are in same zone, and of same order.
489 * count is the number of pages to free. 489 * count is the number of pages to free.
490 * 490 *
491 * If the zone was previously in an "all pages pinned" state then look to 491 * If the zone was previously in an "all pages pinned" state then look to
492 * see if this freeing clears that state. 492 * see if this freeing clears that state.
493 * 493 *
494 * And clear the zone's pages_scanned counter, to hold off the "all pages are 494 * And clear the zone's pages_scanned counter, to hold off the "all pages are
495 * pinned" detection logic. 495 * pinned" detection logic.
496 */ 496 */
497 static void free_pages_bulk(struct zone *zone, int count, 497 static void free_pages_bulk(struct zone *zone, int count,
498 struct list_head *list, int order) 498 struct list_head *list, int order)
499 { 499 {
500 spin_lock(&zone->lock); 500 spin_lock(&zone->lock);
501 zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE); 501 zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
502 zone->pages_scanned = 0; 502 zone->pages_scanned = 0;
503 while (count--) { 503 while (count--) {
504 struct page *page; 504 struct page *page;
505 505
506 VM_BUG_ON(list_empty(list)); 506 VM_BUG_ON(list_empty(list));
507 page = list_entry(list->prev, struct page, lru); 507 page = list_entry(list->prev, struct page, lru);
508 /* have to delete it as __free_one_page list manipulates */ 508 /* have to delete it as __free_one_page list manipulates */
509 list_del(&page->lru); 509 list_del(&page->lru);
510 __free_one_page(page, zone, order); 510 __free_one_page(page, zone, order);
511 } 511 }
512 spin_unlock(&zone->lock); 512 spin_unlock(&zone->lock);
513 } 513 }
514 514
515 static void free_one_page(struct zone *zone, struct page *page, int order) 515 static void free_one_page(struct zone *zone, struct page *page, int order)
516 { 516 {
517 spin_lock(&zone->lock); 517 spin_lock(&zone->lock);
518 zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE); 518 zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
519 zone->pages_scanned = 0; 519 zone->pages_scanned = 0;
520 __free_one_page(page, zone, order); 520 __free_one_page(page, zone, order);
521 spin_unlock(&zone->lock); 521 spin_unlock(&zone->lock);
522 } 522 }
523 523
524 static void __free_pages_ok(struct page *page, unsigned int order) 524 static void __free_pages_ok(struct page *page, unsigned int order)
525 { 525 {
526 unsigned long flags; 526 unsigned long flags;
527 int i; 527 int i;
528 int reserved = 0; 528 int reserved = 0;
529 529
530 for (i = 0 ; i < (1 << order) ; ++i) 530 for (i = 0 ; i < (1 << order) ; ++i)
531 reserved += free_pages_check(page + i); 531 reserved += free_pages_check(page + i);
532 if (reserved) 532 if (reserved)
533 return; 533 return;
534 534
535 if (!PageHighMem(page)) 535 if (!PageHighMem(page))
536 debug_check_no_locks_freed(page_address(page),PAGE_SIZE<<order); 536 debug_check_no_locks_freed(page_address(page),PAGE_SIZE<<order);
537 arch_free_page(page, order); 537 arch_free_page(page, order);
538 kernel_map_pages(page, 1 << order, 0); 538 kernel_map_pages(page, 1 << order, 0);
539 539
540 local_irq_save(flags); 540 local_irq_save(flags);
541 __count_vm_events(PGFREE, 1 << order); 541 __count_vm_events(PGFREE, 1 << order);
542 free_one_page(page_zone(page), page, order); 542 free_one_page(page_zone(page), page, order);
543 local_irq_restore(flags); 543 local_irq_restore(flags);
544 } 544 }
545 545
546 /* 546 /*
547 * permit the bootmem allocator to evade page validation on high-order frees 547 * permit the bootmem allocator to evade page validation on high-order frees
548 */ 548 */
549 void __free_pages_bootmem(struct page *page, unsigned int order) 549 void __free_pages_bootmem(struct page *page, unsigned int order)
550 { 550 {
551 if (order == 0) { 551 if (order == 0) {
552 __ClearPageReserved(page); 552 __ClearPageReserved(page);
553 set_page_count(page, 0); 553 set_page_count(page, 0);
554 set_page_refcounted(page); 554 set_page_refcounted(page);
555 __free_page(page); 555 __free_page(page);
556 } else { 556 } else {
557 int loop; 557 int loop;
558 558
559 prefetchw(page); 559 prefetchw(page);
560 for (loop = 0; loop < BITS_PER_LONG; loop++) { 560 for (loop = 0; loop < BITS_PER_LONG; loop++) {
561 struct page *p = &page[loop]; 561 struct page *p = &page[loop];
562 562
563 if (loop + 1 < BITS_PER_LONG) 563 if (loop + 1 < BITS_PER_LONG)
564 prefetchw(p + 1); 564 prefetchw(p + 1);
565 __ClearPageReserved(p); 565 __ClearPageReserved(p);
566 set_page_count(p, 0); 566 set_page_count(p, 0);
567 } 567 }
568 568
569 set_page_refcounted(page); 569 set_page_refcounted(page);
570 __free_pages(page, order); 570 __free_pages(page, order);
571 } 571 }
572 } 572 }
573 573
574 574
575 /* 575 /*
576 * The order of subdivision here is critical for the IO subsystem. 576 * The order of subdivision here is critical for the IO subsystem.
577 * Please do not alter this order without good reasons and regression 577 * Please do not alter this order without good reasons and regression
578 * testing. Specifically, as large blocks of memory are subdivided, 578 * testing. Specifically, as large blocks of memory are subdivided,
579 * the order in which smaller blocks are delivered depends on the order 579 * the order in which smaller blocks are delivered depends on the order
580 * they're subdivided in this function. This is the primary factor 580 * they're subdivided in this function. This is the primary factor
581 * influencing the order in which pages are delivered to the IO 581 * influencing the order in which pages are delivered to the IO
582 * subsystem according to empirical testing, and this is also justified 582 * subsystem according to empirical testing, and this is also justified
583 * by considering the behavior of a buddy system containing a single 583 * by considering the behavior of a buddy system containing a single
584 * large block of memory acted on by a series of small allocations. 584 * large block of memory acted on by a series of small allocations.
585 * This behavior is a critical factor in sglist merging's success. 585 * This behavior is a critical factor in sglist merging's success.
586 * 586 *
587 * -- wli 587 * -- wli
588 */ 588 */
589 static inline void expand(struct zone *zone, struct page *page, 589 static inline void expand(struct zone *zone, struct page *page,
590 int low, int high, struct free_area *area, 590 int low, int high, struct free_area *area,
591 int migratetype) 591 int migratetype)
592 { 592 {
593 unsigned long size = 1 << high; 593 unsigned long size = 1 << high;
594 594
595 while (high > low) { 595 while (high > low) {
596 area--; 596 area--;
597 high--; 597 high--;
598 size >>= 1; 598 size >>= 1;
599 VM_BUG_ON(bad_range(zone, &page[size])); 599 VM_BUG_ON(bad_range(zone, &page[size]));
600 list_add(&page[size].lru, &area->free_list[migratetype]); 600 list_add(&page[size].lru, &area->free_list[migratetype]);
601 area->nr_free++; 601 area->nr_free++;
602 set_page_order(&page[size], high); 602 set_page_order(&page[size], high);
603 } 603 }
604 } 604 }
605 605
606 /* 606 /*
607 * This page is about to be returned from the page allocator 607 * This page is about to be returned from the page allocator
608 */ 608 */
609 static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) 609 static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
610 { 610 {
611 if (unlikely(page_mapcount(page) | 611 if (unlikely(page_mapcount(page) |
612 (page->mapping != NULL) | 612 (page->mapping != NULL) |
613 (page_get_page_cgroup(page) != NULL) | 613 (page_get_page_cgroup(page) != NULL) |
614 (page_count(page) != 0) | 614 (page_count(page) != 0) |
615 (page->flags & ( 615 (page->flags & (
616 1 << PG_lru | 616 1 << PG_lru |
617 1 << PG_private | 617 1 << PG_private |
618 1 << PG_locked | 618 1 << PG_locked |
619 1 << PG_active | 619 1 << PG_active |
620 1 << PG_dirty | 620 1 << PG_dirty |
621 1 << PG_slab | 621 1 << PG_slab |
622 1 << PG_swapcache | 622 1 << PG_swapcache |
623 1 << PG_writeback | 623 1 << PG_writeback |
624 1 << PG_reserved | 624 1 << PG_reserved |
625 1 << PG_buddy )))) 625 1 << PG_buddy ))))
626 bad_page(page); 626 bad_page(page);
627 627
628 /* 628 /*
629 * For now, we report if PG_reserved was found set, but do not 629 * For now, we report if PG_reserved was found set, but do not
630 * clear it, and do not allocate the page: as a safety net. 630 * clear it, and do not allocate the page: as a safety net.
631 */ 631 */
632 if (PageReserved(page)) 632 if (PageReserved(page))
633 return 1; 633 return 1;
634 634
635 page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_reclaim | 635 page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_reclaim |
636 1 << PG_referenced | 1 << PG_arch_1 | 636 1 << PG_referenced | 1 << PG_arch_1 |
637 1 << PG_owner_priv_1 | 1 << PG_mappedtodisk); 637 1 << PG_owner_priv_1 | 1 << PG_mappedtodisk);
638 set_page_private(page, 0); 638 set_page_private(page, 0);
639 set_page_refcounted(page); 639 set_page_refcounted(page);
640 640
641 arch_alloc_page(page, order); 641 arch_alloc_page(page, order);
642 kernel_map_pages(page, 1 << order, 1); 642 kernel_map_pages(page, 1 << order, 1);
643 643
644 if (gfp_flags & __GFP_ZERO) 644 if (gfp_flags & __GFP_ZERO)
645 prep_zero_page(page, order, gfp_flags); 645 prep_zero_page(page, order, gfp_flags);
646 646
647 if (order && (gfp_flags & __GFP_COMP)) 647 if (order && (gfp_flags & __GFP_COMP))
648 prep_compound_page(page, order); 648 prep_compound_page(page, order);
649 649
650 return 0; 650 return 0;
651 } 651 }
652 652
653 /* 653 /*
654 * Go through the free lists for the given migratetype and remove 654 * Go through the free lists for the given migratetype and remove
655 * the smallest available page from the freelists 655 * the smallest available page from the freelists
656 */ 656 */
657 static struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, 657 static struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
658 int migratetype) 658 int migratetype)
659 { 659 {
660 unsigned int current_order; 660 unsigned int current_order;
661 struct free_area * area; 661 struct free_area * area;
662 struct page *page; 662 struct page *page;
663 663
664 /* Find a page of the appropriate size in the preferred list */ 664 /* Find a page of the appropriate size in the preferred list */
665 for (current_order = order; current_order < MAX_ORDER; ++current_order) { 665 for (current_order = order; current_order < MAX_ORDER; ++current_order) {
666 area = &(zone->free_area[current_order]); 666 area = &(zone->free_area[current_order]);
667 if (list_empty(&area->free_list[migratetype])) 667 if (list_empty(&area->free_list[migratetype]))
668 continue; 668 continue;
669 669
670 page = list_entry(area->free_list[migratetype].next, 670 page = list_entry(area->free_list[migratetype].next,
671 struct page, lru); 671 struct page, lru);
672 list_del(&page->lru); 672 list_del(&page->lru);
673 rmv_page_order(page); 673 rmv_page_order(page);
674 area->nr_free--; 674 area->nr_free--;
675 __mod_zone_page_state(zone, NR_FREE_PAGES, - (1UL << order)); 675 __mod_zone_page_state(zone, NR_FREE_PAGES, - (1UL << order));
676 expand(zone, page, order, current_order, area, migratetype); 676 expand(zone, page, order, current_order, area, migratetype);
677 return page; 677 return page;
678 } 678 }
679 679
680 return NULL; 680 return NULL;
681 } 681 }
682 682
683 683
684 /* 684 /*
685 * This array describes the order lists are fallen back to when 685 * This array describes the order lists are fallen back to when
686 * the free lists for the desirable migrate type are depleted 686 * the free lists for the desirable migrate type are depleted
687 */ 687 */
688 static int fallbacks[MIGRATE_TYPES][MIGRATE_TYPES-1] = { 688 static int fallbacks[MIGRATE_TYPES][MIGRATE_TYPES-1] = {
689 [MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE }, 689 [MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE },
690 [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE }, 690 [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE },
691 [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_RESERVE }, 691 [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_RESERVE },
692 [MIGRATE_RESERVE] = { MIGRATE_RESERVE, MIGRATE_RESERVE, MIGRATE_RESERVE }, /* Never used */ 692 [MIGRATE_RESERVE] = { MIGRATE_RESERVE, MIGRATE_RESERVE, MIGRATE_RESERVE }, /* Never used */
693 }; 693 };
694 694
695 /* 695 /*
696 * Move the free pages in a range to the free lists of the requested type. 696 * Move the free pages in a range to the free lists of the requested type.
697 * Note that start_page and end_pages are not aligned on a pageblock 697 * Note that start_page and end_pages are not aligned on a pageblock
698 * boundary. If alignment is required, use move_freepages_block() 698 * boundary. If alignment is required, use move_freepages_block()
699 */ 699 */
700 int move_freepages(struct zone *zone, 700 int move_freepages(struct zone *zone,
701 struct page *start_page, struct page *end_page, 701 struct page *start_page, struct page *end_page,
702 int migratetype) 702 int migratetype)
703 { 703 {
704 struct page *page; 704 struct page *page;
705 unsigned long order; 705 unsigned long order;
706 int pages_moved = 0; 706 int pages_moved = 0;
707 707
708 #ifndef CONFIG_HOLES_IN_ZONE 708 #ifndef CONFIG_HOLES_IN_ZONE
709 /* 709 /*
710 * page_zone is not safe to call in this context when 710 * page_zone is not safe to call in this context when
711 * CONFIG_HOLES_IN_ZONE is set. This bug check is probably redundant 711 * CONFIG_HOLES_IN_ZONE is set. This bug check is probably redundant
712 * anyway as we check zone boundaries in move_freepages_block(). 712 * anyway as we check zone boundaries in move_freepages_block().
713 * Remove at a later date when no bug reports exist related to 713 * Remove at a later date when no bug reports exist related to
714 * grouping pages by mobility 714 * grouping pages by mobility
715 */ 715 */
716 BUG_ON(page_zone(start_page) != page_zone(end_page)); 716 BUG_ON(page_zone(start_page) != page_zone(end_page));
717 #endif 717 #endif
718 718
719 for (page = start_page; page <= end_page;) { 719 for (page = start_page; page <= end_page;) {
720 if (!pfn_valid_within(page_to_pfn(page))) { 720 if (!pfn_valid_within(page_to_pfn(page))) {
721 page++; 721 page++;
722 continue; 722 continue;
723 } 723 }
724 724
725 if (!PageBuddy(page)) { 725 if (!PageBuddy(page)) {
726 page++; 726 page++;
727 continue; 727 continue;
728 } 728 }
729 729
730 order = page_order(page); 730 order = page_order(page);
731 list_del(&page->lru); 731 list_del(&page->lru);
732 list_add(&page->lru, 732 list_add(&page->lru,
733 &zone->free_area[order].free_list[migratetype]); 733 &zone->free_area[order].free_list[migratetype]);
734 page += 1 << order; 734 page += 1 << order;
735 pages_moved += 1 << order; 735 pages_moved += 1 << order;
736 } 736 }
737 737
738 return pages_moved; 738 return pages_moved;
739 } 739 }
740 740
741 int move_freepages_block(struct zone *zone, struct page *page, int migratetype) 741 int move_freepages_block(struct zone *zone, struct page *page, int migratetype)
742 { 742 {
743 unsigned long start_pfn, end_pfn; 743 unsigned long start_pfn, end_pfn;
744 struct page *start_page, *end_page; 744 struct page *start_page, *end_page;
745 745
746 start_pfn = page_to_pfn(page); 746 start_pfn = page_to_pfn(page);
747 start_pfn = start_pfn & ~(pageblock_nr_pages-1); 747 start_pfn = start_pfn & ~(pageblock_nr_pages-1);
748 start_page = pfn_to_page(start_pfn); 748 start_page = pfn_to_page(start_pfn);
749 end_page = start_page + pageblock_nr_pages - 1; 749 end_page = start_page + pageblock_nr_pages - 1;
750 end_pfn = start_pfn + pageblock_nr_pages - 1; 750 end_pfn = start_pfn + pageblock_nr_pages - 1;
751 751
752 /* Do not cross zone boundaries */ 752 /* Do not cross zone boundaries */
753 if (start_pfn < zone->zone_start_pfn) 753 if (start_pfn < zone->zone_start_pfn)
754 start_page = page; 754 start_page = page;
755 if (end_pfn >= zone->zone_start_pfn + zone->spanned_pages) 755 if (end_pfn >= zone->zone_start_pfn + zone->spanned_pages)
756 return 0; 756 return 0;
757 757
758 return move_freepages(zone, start_page, end_page, migratetype); 758 return move_freepages(zone, start_page, end_page, migratetype);
759 } 759 }
760 760
761 /* Remove an element from the buddy allocator from the fallback list */ 761 /* Remove an element from the buddy allocator from the fallback list */
762 static struct page *__rmqueue_fallback(struct zone *zone, int order, 762 static struct page *__rmqueue_fallback(struct zone *zone, int order,
763 int start_migratetype) 763 int start_migratetype)
764 { 764 {
765 struct free_area * area; 765 struct free_area * area;
766 int current_order; 766 int current_order;
767 struct page *page; 767 struct page *page;
768 int migratetype, i; 768 int migratetype, i;
769 769
770 /* Find the largest possible block of pages in the other list */ 770 /* Find the largest possible block of pages in the other list */
771 for (current_order = MAX_ORDER-1; current_order >= order; 771 for (current_order = MAX_ORDER-1; current_order >= order;
772 --current_order) { 772 --current_order) {
773 for (i = 0; i < MIGRATE_TYPES - 1; i++) { 773 for (i = 0; i < MIGRATE_TYPES - 1; i++) {
774 migratetype = fallbacks[start_migratetype][i]; 774 migratetype = fallbacks[start_migratetype][i];
775 775
776 /* MIGRATE_RESERVE handled later if necessary */ 776 /* MIGRATE_RESERVE handled later if necessary */
777 if (migratetype == MIGRATE_RESERVE) 777 if (migratetype == MIGRATE_RESERVE)
778 continue; 778 continue;
779 779
780 area = &(zone->free_area[current_order]); 780 area = &(zone->free_area[current_order]);
781 if (list_empty(&area->free_list[migratetype])) 781 if (list_empty(&area->free_list[migratetype]))
782 continue; 782 continue;
783 783
784 page = list_entry(area->free_list[migratetype].next, 784 page = list_entry(area->free_list[migratetype].next,
785 struct page, lru); 785 struct page, lru);
786 area->nr_free--; 786 area->nr_free--;
787 787
788 /* 788 /*
789 * If breaking a large block of pages, move all free 789 * If breaking a large block of pages, move all free
790 * pages to the preferred allocation list. If falling 790 * pages to the preferred allocation list. If falling
791 * back for a reclaimable kernel allocation, be more 791 * back for a reclaimable kernel allocation, be more
792 * agressive about taking ownership of free pages 792 * agressive about taking ownership of free pages
793 */ 793 */
794 if (unlikely(current_order >= (pageblock_order >> 1)) || 794 if (unlikely(current_order >= (pageblock_order >> 1)) ||
795 start_migratetype == MIGRATE_RECLAIMABLE) { 795 start_migratetype == MIGRATE_RECLAIMABLE) {
796 unsigned long pages; 796 unsigned long pages;
797 pages = move_freepages_block(zone, page, 797 pages = move_freepages_block(zone, page,
798 start_migratetype); 798 start_migratetype);
799 799
800 /* Claim the whole block if over half of it is free */ 800 /* Claim the whole block if over half of it is free */
801 if (pages >= (1 << (pageblock_order-1))) 801 if (pages >= (1 << (pageblock_order-1)))
802 set_pageblock_migratetype(page, 802 set_pageblock_migratetype(page,
803 start_migratetype); 803 start_migratetype);
804 804
805 migratetype = start_migratetype; 805 migratetype = start_migratetype;
806 } 806 }
807 807
808 /* Remove the page from the freelists */ 808 /* Remove the page from the freelists */
809 list_del(&page->lru); 809 list_del(&page->lru);
810 rmv_page_order(page); 810 rmv_page_order(page);
811 __mod_zone_page_state(zone, NR_FREE_PAGES, 811 __mod_zone_page_state(zone, NR_FREE_PAGES,
812 -(1UL << order)); 812 -(1UL << order));
813 813
814 if (current_order == pageblock_order) 814 if (current_order == pageblock_order)
815 set_pageblock_migratetype(page, 815 set_pageblock_migratetype(page,
816 start_migratetype); 816 start_migratetype);
817 817
818 expand(zone, page, order, current_order, area, migratetype); 818 expand(zone, page, order, current_order, area, migratetype);
819 return page; 819 return page;
820 } 820 }
821 } 821 }
822 822
823 /* Use MIGRATE_RESERVE rather than fail an allocation */ 823 /* Use MIGRATE_RESERVE rather than fail an allocation */
824 return __rmqueue_smallest(zone, order, MIGRATE_RESERVE); 824 return __rmqueue_smallest(zone, order, MIGRATE_RESERVE);
825 } 825 }
826 826
827 /* 827 /*
828 * Do the hard work of removing an element from the buddy allocator. 828 * Do the hard work of removing an element from the buddy allocator.
829 * Call me with the zone->lock already held. 829 * Call me with the zone->lock already held.
830 */ 830 */
831 static struct page *__rmqueue(struct zone *zone, unsigned int order, 831 static struct page *__rmqueue(struct zone *zone, unsigned int order,
832 int migratetype) 832 int migratetype)
833 { 833 {
834 struct page *page; 834 struct page *page;
835 835
836 page = __rmqueue_smallest(zone, order, migratetype); 836 page = __rmqueue_smallest(zone, order, migratetype);
837 837
838 if (unlikely(!page)) 838 if (unlikely(!page))
839 page = __rmqueue_fallback(zone, order, migratetype); 839 page = __rmqueue_fallback(zone, order, migratetype);
840 840
841 return page; 841 return page;
842 } 842 }
843 843
844 /* 844 /*
845 * Obtain a specified number of elements from the buddy allocator, all under 845 * Obtain a specified number of elements from the buddy allocator, all under
846 * a single hold of the lock, for efficiency. Add them to the supplied list. 846 * a single hold of the lock, for efficiency. Add them to the supplied list.
847 * Returns the number of new pages which were placed at *list. 847 * Returns the number of new pages which were placed at *list.
848 */ 848 */
849 static int rmqueue_bulk(struct zone *zone, unsigned int order, 849 static int rmqueue_bulk(struct zone *zone, unsigned int order,
850 unsigned long count, struct list_head *list, 850 unsigned long count, struct list_head *list,
851 int migratetype) 851 int migratetype)
852 { 852 {
853 int i; 853 int i;
854 854
855 spin_lock(&zone->lock); 855 spin_lock(&zone->lock);
856 for (i = 0; i < count; ++i) { 856 for (i = 0; i < count; ++i) {
857 struct page *page = __rmqueue(zone, order, migratetype); 857 struct page *page = __rmqueue(zone, order, migratetype);
858 if (unlikely(page == NULL)) 858 if (unlikely(page == NULL))
859 break; 859 break;
860 860
861 /* 861 /*
862 * Split buddy pages returned by expand() are received here 862 * Split buddy pages returned by expand() are received here
863 * in physical page order. The page is added to the callers and 863 * in physical page order. The page is added to the callers and
864 * list and the list head then moves forward. From the callers 864 * list and the list head then moves forward. From the callers
865 * perspective, the linked list is ordered by page number in 865 * perspective, the linked list is ordered by page number in
866 * some conditions. This is useful for IO devices that can 866 * some conditions. This is useful for IO devices that can
867 * merge IO requests if the physical pages are ordered 867 * merge IO requests if the physical pages are ordered
868 * properly. 868 * properly.
869 */ 869 */
870 list_add(&page->lru, list); 870 list_add(&page->lru, list);
871 set_page_private(page, migratetype); 871 set_page_private(page, migratetype);
872 list = &page->lru; 872 list = &page->lru;
873 } 873 }
874 spin_unlock(&zone->lock); 874 spin_unlock(&zone->lock);
875 return i; 875 return i;
876 } 876 }
877 877
878 #ifdef CONFIG_NUMA 878 #ifdef CONFIG_NUMA
879 /* 879 /*
880 * Called from the vmstat counter updater to drain pagesets of this 880 * Called from the vmstat counter updater to drain pagesets of this
881 * currently executing processor on remote nodes after they have 881 * currently executing processor on remote nodes after they have
882 * expired. 882 * expired.
883 * 883 *
884 * Note that this function must be called with the thread pinned to 884 * Note that this function must be called with the thread pinned to
885 * a single processor. 885 * a single processor.
886 */ 886 */
887 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) 887 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
888 { 888 {
889 unsigned long flags; 889 unsigned long flags;
890 int to_drain; 890 int to_drain;
891 891
892 local_irq_save(flags); 892 local_irq_save(flags);
893 if (pcp->count >= pcp->batch) 893 if (pcp->count >= pcp->batch)
894 to_drain = pcp->batch; 894 to_drain = pcp->batch;
895 else 895 else
896 to_drain = pcp->count; 896 to_drain = pcp->count;
897 free_pages_bulk(zone, to_drain, &pcp->list, 0); 897 free_pages_bulk(zone, to_drain, &pcp->list, 0);
898 pcp->count -= to_drain; 898 pcp->count -= to_drain;
899 local_irq_restore(flags); 899 local_irq_restore(flags);
900 } 900 }
901 #endif 901 #endif
902 902
903 /* 903 /*
904 * Drain pages of the indicated processor. 904 * Drain pages of the indicated processor.
905 * 905 *
906 * The processor must either be the current processor and the 906 * The processor must either be the current processor and the
907 * thread pinned to the current processor or a processor that 907 * thread pinned to the current processor or a processor that
908 * is not online. 908 * is not online.
909 */ 909 */
910 static void drain_pages(unsigned int cpu) 910 static void drain_pages(unsigned int cpu)
911 { 911 {
912 unsigned long flags; 912 unsigned long flags;
913 struct zone *zone; 913 struct zone *zone;
914 914
915 for_each_zone(zone) { 915 for_each_zone(zone) {
916 struct per_cpu_pageset *pset; 916 struct per_cpu_pageset *pset;
917 struct per_cpu_pages *pcp; 917 struct per_cpu_pages *pcp;
918 918
919 if (!populated_zone(zone)) 919 if (!populated_zone(zone))
920 continue; 920 continue;
921 921
922 pset = zone_pcp(zone, cpu); 922 pset = zone_pcp(zone, cpu);
923 923
924 pcp = &pset->pcp; 924 pcp = &pset->pcp;
925 local_irq_save(flags); 925 local_irq_save(flags);
926 free_pages_bulk(zone, pcp->count, &pcp->list, 0); 926 free_pages_bulk(zone, pcp->count, &pcp->list, 0);
927 pcp->count = 0; 927 pcp->count = 0;
928 local_irq_restore(flags); 928 local_irq_restore(flags);
929 } 929 }
930 } 930 }
931 931
932 /* 932 /*
933 * Spill all of this CPU's per-cpu pages back into the buddy allocator. 933 * Spill all of this CPU's per-cpu pages back into the buddy allocator.
934 */ 934 */
935 void drain_local_pages(void *arg) 935 void drain_local_pages(void *arg)
936 { 936 {
937 drain_pages(smp_processor_id()); 937 drain_pages(smp_processor_id());
938 } 938 }
939 939
940 /* 940 /*
941 * Spill all the per-cpu pages from all CPUs back into the buddy allocator 941 * Spill all the per-cpu pages from all CPUs back into the buddy allocator
942 */ 942 */
943 void drain_all_pages(void) 943 void drain_all_pages(void)
944 { 944 {
945 on_each_cpu(drain_local_pages, NULL, 0, 1); 945 on_each_cpu(drain_local_pages, NULL, 0, 1);
946 } 946 }
947 947
948 #ifdef CONFIG_HIBERNATION 948 #ifdef CONFIG_HIBERNATION
949 949
950 void mark_free_pages(struct zone *zone) 950 void mark_free_pages(struct zone *zone)
951 { 951 {
952 unsigned long pfn, max_zone_pfn; 952 unsigned long pfn, max_zone_pfn;
953 unsigned long flags; 953 unsigned long flags;
954 int order, t; 954 int order, t;
955 struct list_head *curr; 955 struct list_head *curr;
956 956
957 if (!zone->spanned_pages) 957 if (!zone->spanned_pages)
958 return; 958 return;
959 959
960 spin_lock_irqsave(&zone->lock, flags); 960 spin_lock_irqsave(&zone->lock, flags);
961 961
962 max_zone_pfn = zone->zone_start_pfn + zone->spanned_pages; 962 max_zone_pfn = zone->zone_start_pfn + zone->spanned_pages;
963 for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) 963 for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++)
964 if (pfn_valid(pfn)) { 964 if (pfn_valid(pfn)) {
965 struct page *page = pfn_to_page(pfn); 965 struct page *page = pfn_to_page(pfn);
966 966
967 if (!swsusp_page_is_forbidden(page)) 967 if (!swsusp_page_is_forbidden(page))
968 swsusp_unset_page_free(page); 968 swsusp_unset_page_free(page);
969 } 969 }
970 970
971 for_each_migratetype_order(order, t) { 971 for_each_migratetype_order(order, t) {
972 list_for_each(curr, &zone->free_area[order].free_list[t]) { 972 list_for_each(curr, &zone->free_area[order].free_list[t]) {
973 unsigned long i; 973 unsigned long i;
974 974
975 pfn = page_to_pfn(list_entry(curr, struct page, lru)); 975 pfn = page_to_pfn(list_entry(curr, struct page, lru));
976 for (i = 0; i < (1UL << order); i++) 976 for (i = 0; i < (1UL << order); i++)
977 swsusp_set_page_free(pfn_to_page(pfn + i)); 977 swsusp_set_page_free(pfn_to_page(pfn + i));
978 } 978 }
979 } 979 }
980 spin_unlock_irqrestore(&zone->lock, flags); 980 spin_unlock_irqrestore(&zone->lock, flags);
981 } 981 }
982 #endif /* CONFIG_PM */ 982 #endif /* CONFIG_PM */
983 983
984 /* 984 /*
985 * Free a 0-order page 985 * Free a 0-order page
986 */ 986 */
987 static void free_hot_cold_page(struct page *page, int cold) 987 static void free_hot_cold_page(struct page *page, int cold)
988 { 988 {
989 struct zone *zone = page_zone(page); 989 struct zone *zone = page_zone(page);
990 struct per_cpu_pages *pcp; 990 struct per_cpu_pages *pcp;
991 unsigned long flags; 991 unsigned long flags;
992 992
993 if (PageAnon(page)) 993 if (PageAnon(page))
994 page->mapping = NULL; 994 page->mapping = NULL;
995 if (free_pages_check(page)) 995 if (free_pages_check(page))
996 return; 996 return;
997 997
998 if (!PageHighMem(page)) 998 if (!PageHighMem(page))
999 debug_check_no_locks_freed(page_address(page), PAGE_SIZE); 999 debug_check_no_locks_freed(page_address(page), PAGE_SIZE);
1000 arch_free_page(page, 0); 1000 arch_free_page(page, 0);
1001 kernel_map_pages(page, 1, 0); 1001 kernel_map_pages(page, 1, 0);
1002 1002
1003 pcp = &zone_pcp(zone, get_cpu())->pcp; 1003 pcp = &zone_pcp(zone, get_cpu())->pcp;
1004 local_irq_save(flags); 1004 local_irq_save(flags);
1005 __count_vm_event(PGFREE); 1005 __count_vm_event(PGFREE);
1006 if (cold) 1006 if (cold)
1007 list_add_tail(&page->lru, &pcp->list); 1007 list_add_tail(&page->lru, &pcp->list);
1008 else 1008 else
1009 list_add(&page->lru, &pcp->list); 1009 list_add(&page->lru, &pcp->list);
1010 set_page_private(page, get_pageblock_migratetype(page)); 1010 set_page_private(page, get_pageblock_migratetype(page));
1011 pcp->count++; 1011 pcp->count++;
1012 if (pcp->count >= pcp->high) { 1012 if (pcp->count >= pcp->high) {
1013 free_pages_bulk(zone, pcp->batch, &pcp->list, 0); 1013 free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
1014 pcp->count -= pcp->batch; 1014 pcp->count -= pcp->batch;
1015 } 1015 }
1016 local_irq_restore(flags); 1016 local_irq_restore(flags);
1017 put_cpu(); 1017 put_cpu();
1018 } 1018 }
1019 1019
1020 void free_hot_page(struct page *page) 1020 void free_hot_page(struct page *page)
1021 { 1021 {
1022 free_hot_cold_page(page, 0); 1022 free_hot_cold_page(page, 0);
1023 } 1023 }
1024 1024
1025 void free_cold_page(struct page *page) 1025 void free_cold_page(struct page *page)
1026 { 1026 {
1027 free_hot_cold_page(page, 1); 1027 free_hot_cold_page(page, 1);
1028 } 1028 }
1029 1029
1030 /* 1030 /*
1031 * split_page takes a non-compound higher-order page, and splits it into 1031 * split_page takes a non-compound higher-order page, and splits it into
1032 * n (1<<order) sub-pages: page[0..n] 1032 * n (1<<order) sub-pages: page[0..n]
1033 * Each sub-page must be freed individually. 1033 * Each sub-page must be freed individually.
1034 * 1034 *
1035 * Note: this is probably too low level an operation for use in drivers. 1035 * Note: this is probably too low level an operation for use in drivers.
1036 * Please consult with lkml before using this in your driver. 1036 * Please consult with lkml before using this in your driver.
1037 */ 1037 */
1038 void split_page(struct page *page, unsigned int order) 1038 void split_page(struct page *page, unsigned int order)
1039 { 1039 {
1040 int i; 1040 int i;
1041 1041
1042 VM_BUG_ON(PageCompound(page)); 1042 VM_BUG_ON(PageCompound(page));
1043 VM_BUG_ON(!page_count(page)); 1043 VM_BUG_ON(!page_count(page));
1044 for (i = 1; i < (1 << order); i++) 1044 for (i = 1; i < (1 << order); i++)
1045 set_page_refcounted(page + i); 1045 set_page_refcounted(page + i);
1046 } 1046 }
1047 1047
1048 /* 1048 /*
1049 * Really, prep_compound_page() should be called from __rmqueue_bulk(). But 1049 * Really, prep_compound_page() should be called from __rmqueue_bulk(). But
1050 * we cheat by calling it from here, in the order > 0 path. Saves a branch 1050 * we cheat by calling it from here, in the order > 0 path. Saves a branch
1051 * or two. 1051 * or two.
1052 */ 1052 */
1053 static struct page *buffered_rmqueue(struct zone *preferred_zone, 1053 static struct page *buffered_rmqueue(struct zone *preferred_zone,
1054 struct zone *zone, int order, gfp_t gfp_flags) 1054 struct zone *zone, int order, gfp_t gfp_flags)
1055 { 1055 {
1056 unsigned long flags; 1056 unsigned long flags;
1057 struct page *page; 1057 struct page *page;
1058 int cold = !!(gfp_flags & __GFP_COLD); 1058 int cold = !!(gfp_flags & __GFP_COLD);
1059 int cpu; 1059 int cpu;
1060 int migratetype = allocflags_to_migratetype(gfp_flags); 1060 int migratetype = allocflags_to_migratetype(gfp_flags);
1061 1061
1062 again: 1062 again:
1063 cpu = get_cpu(); 1063 cpu = get_cpu();
1064 if (likely(order == 0)) { 1064 if (likely(order == 0)) {
1065 struct per_cpu_pages *pcp; 1065 struct per_cpu_pages *pcp;
1066 1066
1067 pcp = &zone_pcp(zone, cpu)->pcp; 1067 pcp = &zone_pcp(zone, cpu)->pcp;
1068 local_irq_save(flags); 1068 local_irq_save(flags);
1069 if (!pcp->count) { 1069 if (!pcp->count) {
1070 pcp->count = rmqueue_bulk(zone, 0, 1070 pcp->count = rmqueue_bulk(zone, 0,
1071 pcp->batch, &pcp->list, migratetype); 1071 pcp->batch, &pcp->list, migratetype);
1072 if (unlikely(!pcp->count)) 1072 if (unlikely(!pcp->count))
1073 goto failed; 1073 goto failed;
1074 } 1074 }
1075 1075
1076 /* Find a page of the appropriate migrate type */ 1076 /* Find a page of the appropriate migrate type */
1077 if (cold) { 1077 if (cold) {
1078 list_for_each_entry_reverse(page, &pcp->list, lru) 1078 list_for_each_entry_reverse(page, &pcp->list, lru)
1079 if (page_private(page) == migratetype) 1079 if (page_private(page) == migratetype)
1080 break; 1080 break;
1081 } else { 1081 } else {
1082 list_for_each_entry(page, &pcp->list, lru) 1082 list_for_each_entry(page, &pcp->list, lru)
1083 if (page_private(page) == migratetype) 1083 if (page_private(page) == migratetype)
1084 break; 1084 break;
1085 } 1085 }
1086 1086
1087 /* Allocate more to the pcp list if necessary */ 1087 /* Allocate more to the pcp list if necessary */
1088 if (unlikely(&page->lru == &pcp->list)) { 1088 if (unlikely(&page->lru == &pcp->list)) {
1089 pcp->count += rmqueue_bulk(zone, 0, 1089 pcp->count += rmqueue_bulk(zone, 0,
1090 pcp->batch, &pcp->list, migratetype); 1090 pcp->batch, &pcp->list, migratetype);
1091 page = list_entry(pcp->list.next, struct page, lru); 1091 page = list_entry(pcp->list.next, struct page, lru);
1092 } 1092 }
1093 1093
1094 list_del(&page->lru); 1094 list_del(&page->lru);
1095 pcp->count--; 1095 pcp->count--;
1096 } else { 1096 } else {
1097 spin_lock_irqsave(&zone->lock, flags); 1097 spin_lock_irqsave(&zone->lock, flags);
1098 page = __rmqueue(zone, order, migratetype); 1098 page = __rmqueue(zone, order, migratetype);
1099 spin_unlock(&zone->lock); 1099 spin_unlock(&zone->lock);
1100 if (!page) 1100 if (!page)
1101 goto failed; 1101 goto failed;
1102 } 1102 }
1103 1103
1104 __count_zone_vm_events(PGALLOC, zone, 1 << order); 1104 __count_zone_vm_events(PGALLOC, zone, 1 << order);
1105 zone_statistics(preferred_zone, zone); 1105 zone_statistics(preferred_zone, zone);
1106 local_irq_restore(flags); 1106 local_irq_restore(flags);
1107 put_cpu(); 1107 put_cpu();
1108 1108
1109 VM_BUG_ON(bad_range(zone, page)); 1109 VM_BUG_ON(bad_range(zone, page));
1110 if (prep_new_page(page, order, gfp_flags)) 1110 if (prep_new_page(page, order, gfp_flags))
1111 goto again; 1111 goto again;
1112 return page; 1112 return page;
1113 1113
1114 failed: 1114 failed:
1115 local_irq_restore(flags); 1115 local_irq_restore(flags);
1116 put_cpu(); 1116 put_cpu();
1117 return NULL; 1117 return NULL;
1118 } 1118 }
1119 1119
1120 #define ALLOC_NO_WATERMARKS 0x01 /* don't check watermarks at all */ 1120 #define ALLOC_NO_WATERMARKS 0x01 /* don't check watermarks at all */
1121 #define ALLOC_WMARK_MIN 0x02 /* use pages_min watermark */ 1121 #define ALLOC_WMARK_MIN 0x02 /* use pages_min watermark */
1122 #define ALLOC_WMARK_LOW 0x04 /* use pages_low watermark */ 1122 #define ALLOC_WMARK_LOW 0x04 /* use pages_low watermark */
1123 #define ALLOC_WMARK_HIGH 0x08 /* use pages_high watermark */ 1123 #define ALLOC_WMARK_HIGH 0x08 /* use pages_high watermark */
1124 #define ALLOC_HARDER 0x10 /* try to alloc harder */ 1124 #define ALLOC_HARDER 0x10 /* try to alloc harder */
1125 #define ALLOC_HIGH 0x20 /* __GFP_HIGH set */ 1125 #define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
1126 #define ALLOC_CPUSET 0x40 /* check for correct cpuset */ 1126 #define ALLOC_CPUSET 0x40 /* check for correct cpuset */
1127 1127
1128 #ifdef CONFIG_FAIL_PAGE_ALLOC 1128 #ifdef CONFIG_FAIL_PAGE_ALLOC
1129 1129
1130 static struct fail_page_alloc_attr { 1130 static struct fail_page_alloc_attr {
1131 struct fault_attr attr; 1131 struct fault_attr attr;
1132 1132
1133 u32 ignore_gfp_highmem; 1133 u32 ignore_gfp_highmem;
1134 u32 ignore_gfp_wait; 1134 u32 ignore_gfp_wait;
1135 u32 min_order; 1135 u32 min_order;
1136 1136
1137 #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS 1137 #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
1138 1138
1139 struct dentry *ignore_gfp_highmem_file; 1139 struct dentry *ignore_gfp_highmem_file;
1140 struct dentry *ignore_gfp_wait_file; 1140 struct dentry *ignore_gfp_wait_file;
1141 struct dentry *min_order_file; 1141 struct dentry *min_order_file;
1142 1142
1143 #endif /* CONFIG_FAULT_INJECTION_DEBUG_FS */ 1143 #endif /* CONFIG_FAULT_INJECTION_DEBUG_FS */
1144 1144
1145 } fail_page_alloc = { 1145 } fail_page_alloc = {
1146 .attr = FAULT_ATTR_INITIALIZER, 1146 .attr = FAULT_ATTR_INITIALIZER,
1147 .ignore_gfp_wait = 1, 1147 .ignore_gfp_wait = 1,
1148 .ignore_gfp_highmem = 1, 1148 .ignore_gfp_highmem = 1,
1149 .min_order = 1, 1149 .min_order = 1,
1150 }; 1150 };
1151 1151
1152 static int __init setup_fail_page_alloc(char *str) 1152 static int __init setup_fail_page_alloc(char *str)
1153 { 1153 {
1154 return setup_fault_attr(&fail_page_alloc.attr, str); 1154 return setup_fault_attr(&fail_page_alloc.attr, str);
1155 } 1155 }
1156 __setup("fail_page_alloc=", setup_fail_page_alloc); 1156 __setup("fail_page_alloc=", setup_fail_page_alloc);
1157 1157
1158 static int should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) 1158 static int should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
1159 { 1159 {
1160 if (order < fail_page_alloc.min_order) 1160 if (order < fail_page_alloc.min_order)
1161 return 0; 1161 return 0;
1162 if (gfp_mask & __GFP_NOFAIL) 1162 if (gfp_mask & __GFP_NOFAIL)
1163 return 0; 1163 return 0;
1164 if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM)) 1164 if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
1165 return 0; 1165 return 0;
1166 if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & __GFP_WAIT)) 1166 if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & __GFP_WAIT))
1167 return 0; 1167 return 0;
1168 1168
1169 return should_fail(&fail_page_alloc.attr, 1 << order); 1169 return should_fail(&fail_page_alloc.attr, 1 << order);
1170 } 1170 }
1171 1171
1172 #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS 1172 #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
1173 1173
1174 static int __init fail_page_alloc_debugfs(void) 1174 static int __init fail_page_alloc_debugfs(void)
1175 { 1175 {
1176 mode_t mode = S_IFREG | S_IRUSR | S_IWUSR; 1176 mode_t mode = S_IFREG | S_IRUSR | S_IWUSR;
1177 struct dentry *dir; 1177 struct dentry *dir;
1178 int err; 1178 int err;
1179 1179
1180 err = init_fault_attr_dentries(&fail_page_alloc.attr, 1180 err = init_fault_attr_dentries(&fail_page_alloc.attr,
1181 "fail_page_alloc"); 1181 "fail_page_alloc");
1182 if (err) 1182 if (err)
1183 return err; 1183 return err;
1184 dir = fail_page_alloc.attr.dentries.dir; 1184 dir = fail_page_alloc.attr.dentries.dir;
1185 1185
1186 fail_page_alloc.ignore_gfp_wait_file = 1186 fail_page_alloc.ignore_gfp_wait_file =
1187 debugfs_create_bool("ignore-gfp-wait", mode, dir, 1187 debugfs_create_bool("ignore-gfp-wait", mode, dir,
1188 &fail_page_alloc.ignore_gfp_wait); 1188 &fail_page_alloc.ignore_gfp_wait);
1189 1189
1190 fail_page_alloc.ignore_gfp_highmem_file = 1190 fail_page_alloc.ignore_gfp_highmem_file =
1191 debugfs_create_bool("ignore-gfp-highmem", mode, dir, 1191 debugfs_create_bool("ignore-gfp-highmem", mode, dir,
1192 &fail_page_alloc.ignore_gfp_highmem); 1192 &fail_page_alloc.ignore_gfp_highmem);
1193 fail_page_alloc.min_order_file = 1193 fail_page_alloc.min_order_file =
1194 debugfs_create_u32("min-order", mode, dir, 1194 debugfs_create_u32("min-order", mode, dir,
1195 &fail_page_alloc.min_order); 1195 &fail_page_alloc.min_order);
1196 1196
1197 if (!fail_page_alloc.ignore_gfp_wait_file || 1197 if (!fail_page_alloc.ignore_gfp_wait_file ||
1198 !fail_page_alloc.ignore_gfp_highmem_file || 1198 !fail_page_alloc.ignore_gfp_highmem_file ||
1199 !fail_page_alloc.min_order_file) { 1199 !fail_page_alloc.min_order_file) {
1200 err = -ENOMEM; 1200 err = -ENOMEM;
1201 debugfs_remove(fail_page_alloc.ignore_gfp_wait_file); 1201 debugfs_remove(fail_page_alloc.ignore_gfp_wait_file);
1202 debugfs_remove(fail_page_alloc.ignore_gfp_highmem_file); 1202 debugfs_remove(fail_page_alloc.ignore_gfp_highmem_file);
1203 debugfs_remove(fail_page_alloc.min_order_file); 1203 debugfs_remove(fail_page_alloc.min_order_file);
1204 cleanup_fault_attr_dentries(&fail_page_alloc.attr); 1204 cleanup_fault_attr_dentries(&fail_page_alloc.attr);
1205 } 1205 }
1206 1206
1207 return err; 1207 return err;
1208 } 1208 }
1209 1209
1210 late_initcall(fail_page_alloc_debugfs); 1210 late_initcall(fail_page_alloc_debugfs);
1211 1211
1212 #endif /* CONFIG_FAULT_INJECTION_DEBUG_FS */ 1212 #endif /* CONFIG_FAULT_INJECTION_DEBUG_FS */
1213 1213
1214 #else /* CONFIG_FAIL_PAGE_ALLOC */ 1214 #else /* CONFIG_FAIL_PAGE_ALLOC */
1215 1215
1216 static inline int should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) 1216 static inline int should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
1217 { 1217 {
1218 return 0; 1218 return 0;
1219 } 1219 }
1220 1220
1221 #endif /* CONFIG_FAIL_PAGE_ALLOC */ 1221 #endif /* CONFIG_FAIL_PAGE_ALLOC */
1222 1222
1223 /* 1223 /*
1224 * Return 1 if free pages are above 'mark'. This takes into account the order 1224 * Return 1 if free pages are above 'mark'. This takes into account the order
1225 * of the allocation. 1225 * of the allocation.
1226 */ 1226 */
1227 int zone_watermark_ok(struct zone *z, int order, unsigned long mark, 1227 int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
1228 int classzone_idx, int alloc_flags) 1228 int classzone_idx, int alloc_flags)
1229 { 1229 {
1230 /* free_pages my go negative - that's OK */ 1230 /* free_pages my go negative - that's OK */
1231 long min = mark; 1231 long min = mark;
1232 long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1; 1232 long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
1233 int o; 1233 int o;
1234 1234
1235 if (alloc_flags & ALLOC_HIGH) 1235 if (alloc_flags & ALLOC_HIGH)
1236 min -= min / 2; 1236 min -= min / 2;
1237 if (alloc_flags & ALLOC_HARDER) 1237 if (alloc_flags & ALLOC_HARDER)
1238 min -= min / 4; 1238 min -= min / 4;
1239 1239
1240 if (free_pages <= min + z->lowmem_reserve[classzone_idx]) 1240 if (free_pages <= min + z->lowmem_reserve[classzone_idx])
1241 return 0; 1241 return 0;
1242 for (o = 0; o < order; o++) { 1242 for (o = 0; o < order; o++) {
1243 /* At the next order, this order's pages become unavailable */ 1243 /* At the next order, this order's pages become unavailable */
1244 free_pages -= z->free_area[o].nr_free << o; 1244 free_pages -= z->free_area[o].nr_free << o;
1245 1245
1246 /* Require fewer higher order pages to be free */ 1246 /* Require fewer higher order pages to be free */
1247 min >>= 1; 1247 min >>= 1;
1248 1248
1249 if (free_pages <= min) 1249 if (free_pages <= min)
1250 return 0; 1250 return 0;
1251 } 1251 }
1252 return 1; 1252 return 1;
1253 } 1253 }
1254 1254
1255 #ifdef CONFIG_NUMA 1255 #ifdef CONFIG_NUMA
1256 /* 1256 /*
1257 * zlc_setup - Setup for "zonelist cache". Uses cached zone data to 1257 * zlc_setup - Setup for "zonelist cache". Uses cached zone data to
1258 * skip over zones that are not allowed by the cpuset, or that have 1258 * skip over zones that are not allowed by the cpuset, or that have
1259 * been recently (in last second) found to be nearly full. See further 1259 * been recently (in last second) found to be nearly full. See further
1260 * comments in mmzone.h. Reduces cache footprint of zonelist scans 1260 * comments in mmzone.h. Reduces cache footprint of zonelist scans
1261 * that have to skip over a lot of full or unallowed zones. 1261 * that have to skip over a lot of full or unallowed zones.
1262 * 1262 *
1263 * If the zonelist cache is present in the passed in zonelist, then 1263 * If the zonelist cache is present in the passed in zonelist, then
1264 * returns a pointer to the allowed node mask (either the current 1264 * returns a pointer to the allowed node mask (either the current
1265 * tasks mems_allowed, or node_states[N_HIGH_MEMORY].) 1265 * tasks mems_allowed, or node_states[N_HIGH_MEMORY].)
1266 * 1266 *
1267 * If the zonelist cache is not available for this zonelist, does 1267 * If the zonelist cache is not available for this zonelist, does
1268 * nothing and returns NULL. 1268 * nothing and returns NULL.
1269 * 1269 *
1270 * If the fullzones BITMAP in the zonelist cache is stale (more than 1270 * If the fullzones BITMAP in the zonelist cache is stale (more than
1271 * a second since last zap'd) then we zap it out (clear its bits.) 1271 * a second since last zap'd) then we zap it out (clear its bits.)
1272 * 1272 *
1273 * We hold off even calling zlc_setup, until after we've checked the 1273 * We hold off even calling zlc_setup, until after we've checked the
1274 * first zone in the zonelist, on the theory that most allocations will 1274 * first zone in the zonelist, on the theory that most allocations will
1275 * be satisfied from that first zone, so best to examine that zone as 1275 * be satisfied from that first zone, so best to examine that zone as
1276 * quickly as we can. 1276 * quickly as we can.
1277 */ 1277 */
1278 static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags) 1278 static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags)
1279 { 1279 {
1280 struct zonelist_cache *zlc; /* cached zonelist speedup info */ 1280 struct zonelist_cache *zlc; /* cached zonelist speedup info */
1281 nodemask_t *allowednodes; /* zonelist_cache approximation */ 1281 nodemask_t *allowednodes; /* zonelist_cache approximation */
1282 1282
1283 zlc = zonelist->zlcache_ptr; 1283 zlc = zonelist->zlcache_ptr;
1284 if (!zlc) 1284 if (!zlc)
1285 return NULL; 1285 return NULL;
1286 1286
1287 if (time_after(jiffies, zlc->last_full_zap + HZ)) { 1287 if (time_after(jiffies, zlc->last_full_zap + HZ)) {
1288 bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST); 1288 bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
1289 zlc->last_full_zap = jiffies; 1289 zlc->last_full_zap = jiffies;
1290 } 1290 }
1291 1291
1292 allowednodes = !in_interrupt() && (alloc_flags & ALLOC_CPUSET) ? 1292 allowednodes = !in_interrupt() && (alloc_flags & ALLOC_CPUSET) ?
1293 &cpuset_current_mems_allowed : 1293 &cpuset_current_mems_allowed :
1294 &node_states[N_HIGH_MEMORY]; 1294 &node_states[N_HIGH_MEMORY];
1295 return allowednodes; 1295 return allowednodes;
1296 } 1296 }
1297 1297
1298 /* 1298 /*
1299 * Given 'z' scanning a zonelist, run a couple of quick checks to see 1299 * Given 'z' scanning a zonelist, run a couple of quick checks to see
1300 * if it is worth looking at further for free memory: 1300 * if it is worth looking at further for free memory:
1301 * 1) Check that the zone isn't thought to be full (doesn't have its 1301 * 1) Check that the zone isn't thought to be full (doesn't have its
1302 * bit set in the zonelist_cache fullzones BITMAP). 1302 * bit set in the zonelist_cache fullzones BITMAP).
1303 * 2) Check that the zones node (obtained from the zonelist_cache 1303 * 2) Check that the zones node (obtained from the zonelist_cache
1304 * z_to_n[] mapping) is allowed in the passed in allowednodes mask. 1304 * z_to_n[] mapping) is allowed in the passed in allowednodes mask.
1305 * Return true (non-zero) if zone is worth looking at further, or 1305 * Return true (non-zero) if zone is worth looking at further, or
1306 * else return false (zero) if it is not. 1306 * else return false (zero) if it is not.
1307 * 1307 *
1308 * This check -ignores- the distinction between various watermarks, 1308 * This check -ignores- the distinction between various watermarks,
1309 * such as GFP_HIGH, GFP_ATOMIC, PF_MEMALLOC, ... If a zone is 1309 * such as GFP_HIGH, GFP_ATOMIC, PF_MEMALLOC, ... If a zone is
1310 * found to be full for any variation of these watermarks, it will 1310 * found to be full for any variation of these watermarks, it will
1311 * be considered full for up to one second by all requests, unless 1311 * be considered full for up to one second by all requests, unless
1312 * we are so low on memory on all allowed nodes that we are forced 1312 * we are so low on memory on all allowed nodes that we are forced
1313 * into the second scan of the zonelist. 1313 * into the second scan of the zonelist.
1314 * 1314 *
1315 * In the second scan we ignore this zonelist cache and exactly 1315 * In the second scan we ignore this zonelist cache and exactly
1316 * apply the watermarks to all zones, even it is slower to do so. 1316 * apply the watermarks to all zones, even it is slower to do so.
1317 * We are low on memory in the second scan, and should leave no stone 1317 * We are low on memory in the second scan, and should leave no stone
1318 * unturned looking for a free page. 1318 * unturned looking for a free page.
1319 */ 1319 */
1320 static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z, 1320 static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z,
1321 nodemask_t *allowednodes) 1321 nodemask_t *allowednodes)
1322 { 1322 {
1323 struct zonelist_cache *zlc; /* cached zonelist speedup info */ 1323 struct zonelist_cache *zlc; /* cached zonelist speedup info */
1324 int i; /* index of *z in zonelist zones */ 1324 int i; /* index of *z in zonelist zones */
1325 int n; /* node that zone *z is on */ 1325 int n; /* node that zone *z is on */
1326 1326
1327 zlc = zonelist->zlcache_ptr; 1327 zlc = zonelist->zlcache_ptr;
1328 if (!zlc) 1328 if (!zlc)
1329 return 1; 1329 return 1;
1330 1330
1331 i = z - zonelist->_zonerefs; 1331 i = z - zonelist->_zonerefs;
1332 n = zlc->z_to_n[i]; 1332 n = zlc->z_to_n[i];
1333 1333
1334 /* This zone is worth trying if it is allowed but not full */ 1334 /* This zone is worth trying if it is allowed but not full */
1335 return node_isset(n, *allowednodes) && !test_bit(i, zlc->fullzones); 1335 return node_isset(n, *allowednodes) && !test_bit(i, zlc->fullzones);
1336 } 1336 }
1337 1337
1338 /* 1338 /*
1339 * Given 'z' scanning a zonelist, set the corresponding bit in 1339 * Given 'z' scanning a zonelist, set the corresponding bit in
1340 * zlc->fullzones, so that subsequent attempts to allocate a page 1340 * zlc->fullzones, so that subsequent attempts to allocate a page
1341 * from that zone don't waste time re-examining it. 1341 * from that zone don't waste time re-examining it.
1342 */ 1342 */
1343 static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z) 1343 static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
1344 { 1344 {
1345 struct zonelist_cache *zlc; /* cached zonelist speedup info */ 1345 struct zonelist_cache *zlc; /* cached zonelist speedup info */
1346 int i; /* index of *z in zonelist zones */ 1346 int i; /* index of *z in zonelist zones */
1347 1347
1348 zlc = zonelist->zlcache_ptr; 1348 zlc = zonelist->zlcache_ptr;
1349 if (!zlc) 1349 if (!zlc)
1350 return; 1350 return;
1351 1351
1352 i = z - zonelist->_zonerefs; 1352 i = z - zonelist->_zonerefs;
1353 1353
1354 set_bit(i, zlc->fullzones); 1354 set_bit(i, zlc->fullzones);
1355 } 1355 }
1356 1356
1357 #else /* CONFIG_NUMA */ 1357 #else /* CONFIG_NUMA */
1358 1358
1359 static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags) 1359 static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags)
1360 { 1360 {
1361 return NULL; 1361 return NULL;
1362 } 1362 }
1363 1363
1364 static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z, 1364 static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z,
1365 nodemask_t *allowednodes) 1365 nodemask_t *allowednodes)
1366 { 1366 {
1367 return 1; 1367 return 1;
1368 } 1368 }
1369 1369
1370 static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z) 1370 static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
1371 { 1371 {
1372 } 1372 }
1373 #endif /* CONFIG_NUMA */ 1373 #endif /* CONFIG_NUMA */
1374 1374
1375 /* 1375 /*
1376 * get_page_from_freelist goes through the zonelist trying to allocate 1376 * get_page_from_freelist goes through the zonelist trying to allocate
1377 * a page. 1377 * a page.
1378 */ 1378 */
1379 static struct page * 1379 static struct page *
1380 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order, 1380 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
1381 struct zonelist *zonelist, int high_zoneidx, int alloc_flags) 1381 struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
1382 { 1382 {
1383 struct zoneref *z; 1383 struct zoneref *z;
1384 struct page *page = NULL; 1384 struct page *page = NULL;
1385 int classzone_idx; 1385 int classzone_idx;
1386 struct zone *zone, *preferred_zone; 1386 struct zone *zone, *preferred_zone;
1387 nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */ 1387 nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
1388 int zlc_active = 0; /* set if using zonelist_cache */ 1388 int zlc_active = 0; /* set if using zonelist_cache */
1389 int did_zlc_setup = 0; /* just call zlc_setup() one time */ 1389 int did_zlc_setup = 0; /* just call zlc_setup() one time */
1390 1390
1391 (void)first_zones_zonelist(zonelist, high_zoneidx, nodemask, 1391 (void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
1392 &preferred_zone); 1392 &preferred_zone);
1393 classzone_idx = zone_idx(preferred_zone); 1393 classzone_idx = zone_idx(preferred_zone);
1394 1394
1395 zonelist_scan: 1395 zonelist_scan:
1396 /* 1396 /*
1397 * Scan zonelist, looking for a zone with enough free. 1397 * Scan zonelist, looking for a zone with enough free.
1398 * See also cpuset_zone_allowed() comment in kernel/cpuset.c. 1398 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
1399 */ 1399 */
1400 for_each_zone_zonelist_nodemask(zone, z, zonelist, 1400 for_each_zone_zonelist_nodemask(zone, z, zonelist,
1401 high_zoneidx, nodemask) { 1401 high_zoneidx, nodemask) {
1402 if (NUMA_BUILD && zlc_active && 1402 if (NUMA_BUILD && zlc_active &&
1403 !zlc_zone_worth_trying(zonelist, z, allowednodes)) 1403 !zlc_zone_worth_trying(zonelist, z, allowednodes))
1404 continue; 1404 continue;
1405 if ((alloc_flags & ALLOC_CPUSET) && 1405 if ((alloc_flags & ALLOC_CPUSET) &&
1406 !cpuset_zone_allowed_softwall(zone, gfp_mask)) 1406 !cpuset_zone_allowed_softwall(zone, gfp_mask))
1407 goto try_next_zone; 1407 goto try_next_zone;
1408 1408
1409 if (!(alloc_flags & ALLOC_NO_WATERMARKS)) { 1409 if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
1410 unsigned long mark; 1410 unsigned long mark;
1411 if (alloc_flags & ALLOC_WMARK_MIN) 1411 if (alloc_flags & ALLOC_WMARK_MIN)
1412 mark = zone->pages_min; 1412 mark = zone->pages_min;
1413 else if (alloc_flags & ALLOC_WMARK_LOW) 1413 else if (alloc_flags & ALLOC_WMARK_LOW)
1414 mark = zone->pages_low; 1414 mark = zone->pages_low;
1415 else 1415 else
1416 mark = zone->pages_high; 1416 mark = zone->pages_high;
1417 if (!zone_watermark_ok(zone, order, mark, 1417 if (!zone_watermark_ok(zone, order, mark,
1418 classzone_idx, alloc_flags)) { 1418 classzone_idx, alloc_flags)) {
1419 if (!zone_reclaim_mode || 1419 if (!zone_reclaim_mode ||
1420 !zone_reclaim(zone, gfp_mask, order)) 1420 !zone_reclaim(zone, gfp_mask, order))
1421 goto this_zone_full; 1421 goto this_zone_full;
1422 } 1422 }
1423 } 1423 }
1424 1424
1425 page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask); 1425 page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask);
1426 if (page) 1426 if (page)
1427 break; 1427 break;
1428 this_zone_full: 1428 this_zone_full:
1429 if (NUMA_BUILD) 1429 if (NUMA_BUILD)
1430 zlc_mark_zone_full(zonelist, z); 1430 zlc_mark_zone_full(zonelist, z);
1431 try_next_zone: 1431 try_next_zone:
1432 if (NUMA_BUILD && !did_zlc_setup) { 1432 if (NUMA_BUILD && !did_zlc_setup) {
1433 /* we do zlc_setup after the first zone is tried */ 1433 /* we do zlc_setup after the first zone is tried */
1434 allowednodes = zlc_setup(zonelist, alloc_flags); 1434 allowednodes = zlc_setup(zonelist, alloc_flags);
1435 zlc_active = 1; 1435 zlc_active = 1;
1436 did_zlc_setup = 1; 1436 did_zlc_setup = 1;
1437 } 1437 }
1438 } 1438 }
1439 1439
1440 if (unlikely(NUMA_BUILD && page == NULL && zlc_active)) { 1440 if (unlikely(NUMA_BUILD && page == NULL && zlc_active)) {
1441 /* Disable zlc cache for second zonelist scan */ 1441 /* Disable zlc cache for second zonelist scan */
1442 zlc_active = 0; 1442 zlc_active = 0;
1443 goto zonelist_scan; 1443 goto zonelist_scan;
1444 } 1444 }
1445 return page; 1445 return page;
1446 } 1446 }
1447 1447
1448 /* 1448 /*
1449 * This is the 'heart' of the zoned buddy allocator. 1449 * This is the 'heart' of the zoned buddy allocator.
1450 */ 1450 */
1451 static struct page * 1451 static struct page *
1452 __alloc_pages_internal(gfp_t gfp_mask, unsigned int order, 1452 __alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
1453 struct zonelist *zonelist, nodemask_t *nodemask) 1453 struct zonelist *zonelist, nodemask_t *nodemask)
1454 { 1454 {
1455 const gfp_t wait = gfp_mask & __GFP_WAIT; 1455 const gfp_t wait = gfp_mask & __GFP_WAIT;
1456 enum zone_type high_zoneidx = gfp_zone(gfp_mask); 1456 enum zone_type high_zoneidx = gfp_zone(gfp_mask);
1457 struct zoneref *z; 1457 struct zoneref *z;
1458 struct zone *zone; 1458 struct zone *zone;
1459 struct page *page; 1459 struct page *page;
1460 struct reclaim_state reclaim_state; 1460 struct reclaim_state reclaim_state;
1461 struct task_struct *p = current; 1461 struct task_struct *p = current;
1462 int do_retry; 1462 int do_retry;
1463 int alloc_flags; 1463 int alloc_flags;
1464 int did_some_progress; 1464 unsigned long did_some_progress;
1465 unsigned long pages_reclaimed = 0;
1465 1466
1466 might_sleep_if(wait); 1467 might_sleep_if(wait);
1467 1468
1468 if (should_fail_alloc_page(gfp_mask, order)) 1469 if (should_fail_alloc_page(gfp_mask, order))
1469 return NULL; 1470 return NULL;
1470 1471
1471 restart: 1472 restart:
1472 z = zonelist->_zonerefs; /* the list of zones suitable for gfp_mask */ 1473 z = zonelist->_zonerefs; /* the list of zones suitable for gfp_mask */
1473 1474
1474 if (unlikely(!z->zone)) { 1475 if (unlikely(!z->zone)) {
1475 /* 1476 /*
1476 * Happens if we have an empty zonelist as a result of 1477 * Happens if we have an empty zonelist as a result of
1477 * GFP_THISNODE being used on a memoryless node 1478 * GFP_THISNODE being used on a memoryless node
1478 */ 1479 */
1479 return NULL; 1480 return NULL;
1480 } 1481 }
1481 1482
1482 page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order, 1483 page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
1483 zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET); 1484 zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
1484 if (page) 1485 if (page)
1485 goto got_pg; 1486 goto got_pg;
1486 1487
1487 /* 1488 /*
1488 * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and 1489 * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
1489 * __GFP_NOWARN set) should not cause reclaim since the subsystem 1490 * __GFP_NOWARN set) should not cause reclaim since the subsystem
1490 * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim 1491 * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim
1491 * using a larger set of nodes after it has established that the 1492 * using a larger set of nodes after it has established that the
1492 * allowed per node queues are empty and that nodes are 1493 * allowed per node queues are empty and that nodes are
1493 * over allocated. 1494 * over allocated.
1494 */ 1495 */
1495 if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE) 1496 if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
1496 goto nopage; 1497 goto nopage;
1497 1498
1498 for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) 1499 for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
1499 wakeup_kswapd(zone, order); 1500 wakeup_kswapd(zone, order);
1500 1501
1501 /* 1502 /*
1502 * OK, we're below the kswapd watermark and have kicked background 1503 * OK, we're below the kswapd watermark and have kicked background
1503 * reclaim. Now things get more complex, so set up alloc_flags according 1504 * reclaim. Now things get more complex, so set up alloc_flags according
1504 * to how we want to proceed. 1505 * to how we want to proceed.
1505 * 1506 *
1506 * The caller may dip into page reserves a bit more if the caller 1507 * The caller may dip into page reserves a bit more if the caller
1507 * cannot run direct reclaim, or if the caller has realtime scheduling 1508 * cannot run direct reclaim, or if the caller has realtime scheduling
1508 * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will 1509 * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will
1509 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH). 1510 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
1510 */ 1511 */
1511 alloc_flags = ALLOC_WMARK_MIN; 1512 alloc_flags = ALLOC_WMARK_MIN;
1512 if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait) 1513 if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
1513 alloc_flags |= ALLOC_HARDER; 1514 alloc_flags |= ALLOC_HARDER;
1514 if (gfp_mask & __GFP_HIGH) 1515 if (gfp_mask & __GFP_HIGH)
1515 alloc_flags |= ALLOC_HIGH; 1516 alloc_flags |= ALLOC_HIGH;
1516 if (wait) 1517 if (wait)
1517 alloc_flags |= ALLOC_CPUSET; 1518 alloc_flags |= ALLOC_CPUSET;
1518 1519
1519 /* 1520 /*
1520 * Go through the zonelist again. Let __GFP_HIGH and allocations 1521 * Go through the zonelist again. Let __GFP_HIGH and allocations
1521 * coming from realtime tasks go deeper into reserves. 1522 * coming from realtime tasks go deeper into reserves.
1522 * 1523 *
1523 * This is the last chance, in general, before the goto nopage. 1524 * This is the last chance, in general, before the goto nopage.
1524 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc. 1525 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
1525 * See also cpuset_zone_allowed() comment in kernel/cpuset.c. 1526 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
1526 */ 1527 */
1527 page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist, 1528 page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
1528 high_zoneidx, alloc_flags); 1529 high_zoneidx, alloc_flags);
1529 if (page) 1530 if (page)
1530 goto got_pg; 1531 goto got_pg;
1531 1532
1532 /* This allocation should allow future memory freeing. */ 1533 /* This allocation should allow future memory freeing. */
1533 1534
1534 rebalance: 1535 rebalance:
1535 if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE))) 1536 if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
1536 && !in_interrupt()) { 1537 && !in_interrupt()) {
1537 if (!(gfp_mask & __GFP_NOMEMALLOC)) { 1538 if (!(gfp_mask & __GFP_NOMEMALLOC)) {
1538 nofail_alloc: 1539 nofail_alloc:
1539 /* go through the zonelist yet again, ignoring mins */ 1540 /* go through the zonelist yet again, ignoring mins */
1540 page = get_page_from_freelist(gfp_mask, nodemask, order, 1541 page = get_page_from_freelist(gfp_mask, nodemask, order,
1541 zonelist, high_zoneidx, ALLOC_NO_WATERMARKS); 1542 zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
1542 if (page) 1543 if (page)
1543 goto got_pg; 1544 goto got_pg;
1544 if (gfp_mask & __GFP_NOFAIL) { 1545 if (gfp_mask & __GFP_NOFAIL) {
1545 congestion_wait(WRITE, HZ/50); 1546 congestion_wait(WRITE, HZ/50);
1546 goto nofail_alloc; 1547 goto nofail_alloc;
1547 } 1548 }
1548 } 1549 }
1549 goto nopage; 1550 goto nopage;
1550 } 1551 }
1551 1552
1552 /* Atomic allocations - we can't balance anything */ 1553 /* Atomic allocations - we can't balance anything */
1553 if (!wait) 1554 if (!wait)
1554 goto nopage; 1555 goto nopage;
1555 1556
1556 cond_resched(); 1557 cond_resched();
1557 1558
1558 /* We now go into synchronous reclaim */ 1559 /* We now go into synchronous reclaim */
1559 cpuset_memory_pressure_bump(); 1560 cpuset_memory_pressure_bump();
1560 p->flags |= PF_MEMALLOC; 1561 p->flags |= PF_MEMALLOC;
1561 reclaim_state.reclaimed_slab = 0; 1562 reclaim_state.reclaimed_slab = 0;
1562 p->reclaim_state = &reclaim_state; 1563 p->reclaim_state = &reclaim_state;
1563 1564
1564 did_some_progress = try_to_free_pages(zonelist, order, gfp_mask); 1565 did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
1565 1566
1566 p->reclaim_state = NULL; 1567 p->reclaim_state = NULL;
1567 p->flags &= ~PF_MEMALLOC; 1568 p->flags &= ~PF_MEMALLOC;
1568 1569
1569 cond_resched(); 1570 cond_resched();
1570 1571
1571 if (order != 0) 1572 if (order != 0)
1572 drain_all_pages(); 1573 drain_all_pages();
1573 1574
1574 if (likely(did_some_progress)) { 1575 if (likely(did_some_progress)) {
1575 page = get_page_from_freelist(gfp_mask, nodemask, order, 1576 page = get_page_from_freelist(gfp_mask, nodemask, order,
1576 zonelist, high_zoneidx, alloc_flags); 1577 zonelist, high_zoneidx, alloc_flags);
1577 if (page) 1578 if (page)
1578 goto got_pg; 1579 goto got_pg;
1579 } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) { 1580 } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
1580 if (!try_set_zone_oom(zonelist, gfp_mask)) { 1581 if (!try_set_zone_oom(zonelist, gfp_mask)) {
1581 schedule_timeout_uninterruptible(1); 1582 schedule_timeout_uninterruptible(1);
1582 goto restart; 1583 goto restart;
1583 } 1584 }
1584 1585
1585 /* 1586 /*
1586 * Go through the zonelist yet one more time, keep 1587 * Go through the zonelist yet one more time, keep
1587 * very high watermark here, this is only to catch 1588 * very high watermark here, this is only to catch
1588 * a parallel oom killing, we must fail if we're still 1589 * a parallel oom killing, we must fail if we're still
1589 * under heavy pressure. 1590 * under heavy pressure.
1590 */ 1591 */
1591 page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, 1592 page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
1592 order, zonelist, high_zoneidx, 1593 order, zonelist, high_zoneidx,
1593 ALLOC_WMARK_HIGH|ALLOC_CPUSET); 1594 ALLOC_WMARK_HIGH|ALLOC_CPUSET);
1594 if (page) { 1595 if (page) {
1595 clear_zonelist_oom(zonelist, gfp_mask); 1596 clear_zonelist_oom(zonelist, gfp_mask);
1596 goto got_pg; 1597 goto got_pg;
1597 } 1598 }
1598 1599
1599 /* The OOM killer will not help higher order allocs so fail */ 1600 /* The OOM killer will not help higher order allocs so fail */
1600 if (order > PAGE_ALLOC_COSTLY_ORDER) { 1601 if (order > PAGE_ALLOC_COSTLY_ORDER) {
1601 clear_zonelist_oom(zonelist, gfp_mask); 1602 clear_zonelist_oom(zonelist, gfp_mask);
1602 goto nopage; 1603 goto nopage;
1603 } 1604 }
1604 1605
1605 out_of_memory(zonelist, gfp_mask, order); 1606 out_of_memory(zonelist, gfp_mask, order);
1606 clear_zonelist_oom(zonelist, gfp_mask); 1607 clear_zonelist_oom(zonelist, gfp_mask);
1607 goto restart; 1608 goto restart;
1608 } 1609 }
1609 1610
1610 /* 1611 /*
1611 * Don't let big-order allocations loop unless the caller explicitly 1612 * Don't let big-order allocations loop unless the caller explicitly
1612 * requests that. Wait for some write requests to complete then retry. 1613 * requests that. Wait for some write requests to complete then retry.
1613 * 1614 *
1614 * In this implementation, either order <= PAGE_ALLOC_COSTLY_ORDER or 1615 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
1615 * __GFP_REPEAT mean __GFP_NOFAIL, but that may not be true in other 1616 * means __GFP_NOFAIL, but that may not be true in other
1616 * implementations. 1617 * implementations.
1618 *
1619 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
1620 * specified, then we retry until we no longer reclaim any pages
1621 * (above), or we've reclaimed an order of pages at least as
1622 * large as the allocation's order. In both cases, if the
1623 * allocation still fails, we stop retrying.
1617 */ 1624 */
1625 pages_reclaimed += did_some_progress;
1618 do_retry = 0; 1626 do_retry = 0;
1619 if (!(gfp_mask & __GFP_NORETRY)) { 1627 if (!(gfp_mask & __GFP_NORETRY)) {
1620 if ((order <= PAGE_ALLOC_COSTLY_ORDER) || 1628 if (order <= PAGE_ALLOC_COSTLY_ORDER) {
1621 (gfp_mask & __GFP_REPEAT))
1622 do_retry = 1; 1629 do_retry = 1;
1630 } else {
1631 if (gfp_mask & __GFP_REPEAT &&
1632 pages_reclaimed < (1 << order))
1633 do_retry = 1;
1634 }
1623 if (gfp_mask & __GFP_NOFAIL) 1635 if (gfp_mask & __GFP_NOFAIL)
1624 do_retry = 1; 1636 do_retry = 1;
1625 } 1637 }
1626 if (do_retry) { 1638 if (do_retry) {
1627 congestion_wait(WRITE, HZ/50); 1639 congestion_wait(WRITE, HZ/50);
1628 goto rebalance; 1640 goto rebalance;
1629 } 1641 }
1630 1642
1631 nopage: 1643 nopage:
1632 if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) { 1644 if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
1633 printk(KERN_WARNING "%s: page allocation failure." 1645 printk(KERN_WARNING "%s: page allocation failure."
1634 " order:%d, mode:0x%x\n", 1646 " order:%d, mode:0x%x\n",
1635 p->comm, order, gfp_mask); 1647 p->comm, order, gfp_mask);
1636 dump_stack(); 1648 dump_stack();
1637 show_mem(); 1649 show_mem();
1638 } 1650 }
1639 got_pg: 1651 got_pg:
1640 return page; 1652 return page;
1641 } 1653 }
1642 1654
1643 struct page * 1655 struct page *
1644 __alloc_pages(gfp_t gfp_mask, unsigned int order, 1656 __alloc_pages(gfp_t gfp_mask, unsigned int order,
1645 struct zonelist *zonelist) 1657 struct zonelist *zonelist)
1646 { 1658 {
1647 return __alloc_pages_internal(gfp_mask, order, zonelist, NULL); 1659 return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
1648 } 1660 }
1649 1661
1650 struct page * 1662 struct page *
1651 __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, 1663 __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
1652 struct zonelist *zonelist, nodemask_t *nodemask) 1664 struct zonelist *zonelist, nodemask_t *nodemask)
1653 { 1665 {
1654 return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask); 1666 return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
1655 } 1667 }
1656 1668
1657 EXPORT_SYMBOL(__alloc_pages); 1669 EXPORT_SYMBOL(__alloc_pages);
1658 1670
1659 /* 1671 /*
1660 * Common helper functions. 1672 * Common helper functions.
1661 */ 1673 */
1662 unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order) 1674 unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)
1663 { 1675 {
1664 struct page * page; 1676 struct page * page;
1665 page = alloc_pages(gfp_mask, order); 1677 page = alloc_pages(gfp_mask, order);
1666 if (!page) 1678 if (!page)
1667 return 0; 1679 return 0;
1668 return (unsigned long) page_address(page); 1680 return (unsigned long) page_address(page);
1669 } 1681 }
1670 1682
1671 EXPORT_SYMBOL(__get_free_pages); 1683 EXPORT_SYMBOL(__get_free_pages);
1672 1684
1673 unsigned long get_zeroed_page(gfp_t gfp_mask) 1685 unsigned long get_zeroed_page(gfp_t gfp_mask)
1674 { 1686 {
1675 struct page * page; 1687 struct page * page;
1676 1688
1677 /* 1689 /*
1678 * get_zeroed_page() returns a 32-bit address, which cannot represent 1690 * get_zeroed_page() returns a 32-bit address, which cannot represent
1679 * a highmem page 1691 * a highmem page
1680 */ 1692 */
1681 VM_BUG_ON((gfp_mask & __GFP_HIGHMEM) != 0); 1693 VM_BUG_ON((gfp_mask & __GFP_HIGHMEM) != 0);
1682 1694
1683 page = alloc_pages(gfp_mask | __GFP_ZERO, 0); 1695 page = alloc_pages(gfp_mask | __GFP_ZERO, 0);
1684 if (page) 1696 if (page)
1685 return (unsigned long) page_address(page); 1697 return (unsigned long) page_address(page);
1686 return 0; 1698 return 0;
1687 } 1699 }
1688 1700
1689 EXPORT_SYMBOL(get_zeroed_page); 1701 EXPORT_SYMBOL(get_zeroed_page);
1690 1702
1691 void __pagevec_free(struct pagevec *pvec) 1703 void __pagevec_free(struct pagevec *pvec)
1692 { 1704 {
1693 int i = pagevec_count(pvec); 1705 int i = pagevec_count(pvec);
1694 1706
1695 while (--i >= 0) 1707 while (--i >= 0)
1696 free_hot_cold_page(pvec->pages[i], pvec->cold); 1708 free_hot_cold_page(pvec->pages[i], pvec->cold);
1697 } 1709 }
1698 1710
1699 void __free_pages(struct page *page, unsigned int order) 1711 void __free_pages(struct page *page, unsigned int order)
1700 { 1712 {
1701 if (put_page_testzero(page)) { 1713 if (put_page_testzero(page)) {
1702 if (order == 0) 1714 if (order == 0)
1703 free_hot_page(page); 1715 free_hot_page(page);
1704 else 1716 else
1705 __free_pages_ok(page, order); 1717 __free_pages_ok(page, order);
1706 } 1718 }
1707 } 1719 }
1708 1720
1709 EXPORT_SYMBOL(__free_pages); 1721 EXPORT_SYMBOL(__free_pages);
1710 1722
1711 void free_pages(unsigned long addr, unsigned int order) 1723 void free_pages(unsigned long addr, unsigned int order)
1712 { 1724 {
1713 if (addr != 0) { 1725 if (addr != 0) {
1714 VM_BUG_ON(!virt_addr_valid((void *)addr)); 1726 VM_BUG_ON(!virt_addr_valid((void *)addr));
1715 __free_pages(virt_to_page((void *)addr), order); 1727 __free_pages(virt_to_page((void *)addr), order);
1716 } 1728 }
1717 } 1729 }
1718 1730
1719 EXPORT_SYMBOL(free_pages); 1731 EXPORT_SYMBOL(free_pages);
1720 1732
1721 static unsigned int nr_free_zone_pages(int offset) 1733 static unsigned int nr_free_zone_pages(int offset)
1722 { 1734 {
1723 struct zoneref *z; 1735 struct zoneref *z;
1724 struct zone *zone; 1736 struct zone *zone;
1725 1737
1726 /* Just pick one node, since fallback list is circular */ 1738 /* Just pick one node, since fallback list is circular */
1727 unsigned int sum = 0; 1739 unsigned int sum = 0;
1728 1740
1729 struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL); 1741 struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);
1730 1742
1731 for_each_zone_zonelist(zone, z, zonelist, offset) { 1743 for_each_zone_zonelist(zone, z, zonelist, offset) {
1732 unsigned long size = zone->present_pages; 1744 unsigned long size = zone->present_pages;
1733 unsigned long high = zone->pages_high; 1745 unsigned long high = zone->pages_high;
1734 if (size > high) 1746 if (size > high)
1735 sum += size - high; 1747 sum += size - high;
1736 } 1748 }
1737 1749
1738 return sum; 1750 return sum;
1739 } 1751 }
1740 1752
1741 /* 1753 /*
1742 * Amount of free RAM allocatable within ZONE_DMA and ZONE_NORMAL 1754 * Amount of free RAM allocatable within ZONE_DMA and ZONE_NORMAL
1743 */ 1755 */
1744 unsigned int nr_free_buffer_pages(void) 1756 unsigned int nr_free_buffer_pages(void)
1745 { 1757 {
1746 return nr_free_zone_pages(gfp_zone(GFP_USER)); 1758 return nr_free_zone_pages(gfp_zone(GFP_USER));
1747 } 1759 }
1748 EXPORT_SYMBOL_GPL(nr_free_buffer_pages); 1760 EXPORT_SYMBOL_GPL(nr_free_buffer_pages);
1749 1761
1750 /* 1762 /*
1751 * Amount of free RAM allocatable within all zones 1763 * Amount of free RAM allocatable within all zones
1752 */ 1764 */
1753 unsigned int nr_free_pagecache_pages(void) 1765 unsigned int nr_free_pagecache_pages(void)
1754 { 1766 {
1755 return nr_free_zone_pages(gfp_zone(GFP_HIGHUSER_MOVABLE)); 1767 return nr_free_zone_pages(gfp_zone(GFP_HIGHUSER_MOVABLE));
1756 } 1768 }
1757 1769
1758 static inline void show_node(struct zone *zone) 1770 static inline void show_node(struct zone *zone)
1759 { 1771 {
1760 if (NUMA_BUILD) 1772 if (NUMA_BUILD)
1761 printk("Node %d ", zone_to_nid(zone)); 1773 printk("Node %d ", zone_to_nid(zone));
1762 } 1774 }
1763 1775
1764 void si_meminfo(struct sysinfo *val) 1776 void si_meminfo(struct sysinfo *val)
1765 { 1777 {
1766 val->totalram = totalram_pages; 1778 val->totalram = totalram_pages;
1767 val->sharedram = 0; 1779 val->sharedram = 0;
1768 val->freeram = global_page_state(NR_FREE_PAGES); 1780 val->freeram = global_page_state(NR_FREE_PAGES);
1769 val->bufferram = nr_blockdev_pages(); 1781 val->bufferram = nr_blockdev_pages();
1770 val->totalhigh = totalhigh_pages; 1782 val->totalhigh = totalhigh_pages;
1771 val->freehigh = nr_free_highpages(); 1783 val->freehigh = nr_free_highpages();
1772 val->mem_unit = PAGE_SIZE; 1784 val->mem_unit = PAGE_SIZE;
1773 } 1785 }
1774 1786
1775 EXPORT_SYMBOL(si_meminfo); 1787 EXPORT_SYMBOL(si_meminfo);
1776 1788
1777 #ifdef CONFIG_NUMA 1789 #ifdef CONFIG_NUMA
1778 void si_meminfo_node(struct sysinfo *val, int nid) 1790 void si_meminfo_node(struct sysinfo *val, int nid)
1779 { 1791 {
1780 pg_data_t *pgdat = NODE_DATA(nid); 1792 pg_data_t *pgdat = NODE_DATA(nid);
1781 1793
1782 val->totalram = pgdat->node_present_pages; 1794 val->totalram = pgdat->node_present_pages;
1783 val->freeram = node_page_state(nid, NR_FREE_PAGES); 1795 val->freeram = node_page_state(nid, NR_FREE_PAGES);
1784 #ifdef CONFIG_HIGHMEM 1796 #ifdef CONFIG_HIGHMEM
1785 val->totalhigh = pgdat->node_zones[ZONE_HIGHMEM].present_pages; 1797 val->totalhigh = pgdat->node_zones[ZONE_HIGHMEM].present_pages;
1786 val->freehigh = zone_page_state(&pgdat->node_zones[ZONE_HIGHMEM], 1798 val->freehigh = zone_page_state(&pgdat->node_zones[ZONE_HIGHMEM],
1787 NR_FREE_PAGES); 1799 NR_FREE_PAGES);
1788 #else 1800 #else
1789 val->totalhigh = 0; 1801 val->totalhigh = 0;
1790 val->freehigh = 0; 1802 val->freehigh = 0;
1791 #endif 1803 #endif
1792 val->mem_unit = PAGE_SIZE; 1804 val->mem_unit = PAGE_SIZE;
1793 } 1805 }
1794 #endif 1806 #endif
1795 1807
1796 #define K(x) ((x) << (PAGE_SHIFT-10)) 1808 #define K(x) ((x) << (PAGE_SHIFT-10))
1797 1809
1798 /* 1810 /*
1799 * Show free area list (used inside shift_scroll-lock stuff) 1811 * Show free area list (used inside shift_scroll-lock stuff)
1800 * We also calculate the percentage fragmentation. We do this by counting the 1812 * We also calculate the percentage fragmentation. We do this by counting the
1801 * memory on each free list with the exception of the first item on the list. 1813 * memory on each free list with the exception of the first item on the list.
1802 */ 1814 */
1803 void show_free_areas(void) 1815 void show_free_areas(void)
1804 { 1816 {
1805 int cpu; 1817 int cpu;
1806 struct zone *zone; 1818 struct zone *zone;
1807 1819
1808 for_each_zone(zone) { 1820 for_each_zone(zone) {
1809 if (!populated_zone(zone)) 1821 if (!populated_zone(zone))
1810 continue; 1822 continue;
1811 1823
1812 show_node(zone); 1824 show_node(zone);
1813 printk("%s per-cpu:\n", zone->name); 1825 printk("%s per-cpu:\n", zone->name);
1814 1826
1815 for_each_online_cpu(cpu) { 1827 for_each_online_cpu(cpu) {
1816 struct per_cpu_pageset *pageset; 1828 struct per_cpu_pageset *pageset;
1817 1829
1818 pageset = zone_pcp(zone, cpu); 1830 pageset = zone_pcp(zone, cpu);
1819 1831
1820 printk("CPU %4d: hi:%5d, btch:%4d usd:%4d\n", 1832 printk("CPU %4d: hi:%5d, btch:%4d usd:%4d\n",
1821 cpu, pageset->pcp.high, 1833 cpu, pageset->pcp.high,
1822 pageset->pcp.batch, pageset->pcp.count); 1834 pageset->pcp.batch, pageset->pcp.count);
1823 } 1835 }
1824 } 1836 }
1825 1837
1826 printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu unstable:%lu\n" 1838 printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu unstable:%lu\n"
1827 " free:%lu slab:%lu mapped:%lu pagetables:%lu bounce:%lu\n", 1839 " free:%lu slab:%lu mapped:%lu pagetables:%lu bounce:%lu\n",
1828 global_page_state(NR_ACTIVE), 1840 global_page_state(NR_ACTIVE),
1829 global_page_state(NR_INACTIVE), 1841 global_page_state(NR_INACTIVE),
1830 global_page_state(NR_FILE_DIRTY), 1842 global_page_state(NR_FILE_DIRTY),
1831 global_page_state(NR_WRITEBACK), 1843 global_page_state(NR_WRITEBACK),
1832 global_page_state(NR_UNSTABLE_NFS), 1844 global_page_state(NR_UNSTABLE_NFS),
1833 global_page_state(NR_FREE_PAGES), 1845 global_page_state(NR_FREE_PAGES),
1834 global_page_state(NR_SLAB_RECLAIMABLE) + 1846 global_page_state(NR_SLAB_RECLAIMABLE) +
1835 global_page_state(NR_SLAB_UNRECLAIMABLE), 1847 global_page_state(NR_SLAB_UNRECLAIMABLE),
1836 global_page_state(NR_FILE_MAPPED), 1848 global_page_state(NR_FILE_MAPPED),
1837 global_page_state(NR_PAGETABLE), 1849 global_page_state(NR_PAGETABLE),
1838 global_page_state(NR_BOUNCE)); 1850 global_page_state(NR_BOUNCE));
1839 1851
1840 for_each_zone(zone) { 1852 for_each_zone(zone) {
1841 int i; 1853 int i;
1842 1854
1843 if (!populated_zone(zone)) 1855 if (!populated_zone(zone))
1844 continue; 1856 continue;
1845 1857
1846 show_node(zone); 1858 show_node(zone);
1847 printk("%s" 1859 printk("%s"
1848 " free:%lukB" 1860 " free:%lukB"
1849 " min:%lukB" 1861 " min:%lukB"
1850 " low:%lukB" 1862 " low:%lukB"
1851 " high:%lukB" 1863 " high:%lukB"
1852 " active:%lukB" 1864 " active:%lukB"
1853 " inactive:%lukB" 1865 " inactive:%lukB"
1854 " present:%lukB" 1866 " present:%lukB"
1855 " pages_scanned:%lu" 1867 " pages_scanned:%lu"
1856 " all_unreclaimable? %s" 1868 " all_unreclaimable? %s"
1857 "\n", 1869 "\n",
1858 zone->name, 1870 zone->name,
1859 K(zone_page_state(zone, NR_FREE_PAGES)), 1871 K(zone_page_state(zone, NR_FREE_PAGES)),
1860 K(zone->pages_min), 1872 K(zone->pages_min),
1861 K(zone->pages_low), 1873 K(zone->pages_low),
1862 K(zone->pages_high), 1874 K(zone->pages_high),
1863 K(zone_page_state(zone, NR_ACTIVE)), 1875 K(zone_page_state(zone, NR_ACTIVE)),
1864 K(zone_page_state(zone, NR_INACTIVE)), 1876 K(zone_page_state(zone, NR_INACTIVE)),
1865 K(zone->present_pages), 1877 K(zone->present_pages),
1866 zone->pages_scanned, 1878 zone->pages_scanned,
1867 (zone_is_all_unreclaimable(zone) ? "yes" : "no") 1879 (zone_is_all_unreclaimable(zone) ? "yes" : "no")
1868 ); 1880 );
1869 printk("lowmem_reserve[]:"); 1881 printk("lowmem_reserve[]:");
1870 for (i = 0; i < MAX_NR_ZONES; i++) 1882 for (i = 0; i < MAX_NR_ZONES; i++)
1871 printk(" %lu", zone->lowmem_reserve[i]); 1883 printk(" %lu", zone->lowmem_reserve[i]);
1872 printk("\n"); 1884 printk("\n");
1873 } 1885 }
1874 1886
1875 for_each_zone(zone) { 1887 for_each_zone(zone) {
1876 unsigned long nr[MAX_ORDER], flags, order, total = 0; 1888 unsigned long nr[MAX_ORDER], flags, order, total = 0;
1877 1889
1878 if (!populated_zone(zone)) 1890 if (!populated_zone(zone))
1879 continue; 1891 continue;
1880 1892
1881 show_node(zone); 1893 show_node(zone);
1882 printk("%s: ", zone->name); 1894 printk("%s: ", zone->name);
1883 1895
1884 spin_lock_irqsave(&zone->lock, flags); 1896 spin_lock_irqsave(&zone->lock, flags);
1885 for (order = 0; order < MAX_ORDER; order++) { 1897 for (order = 0; order < MAX_ORDER; order++) {
1886 nr[order] = zone->free_area[order].nr_free; 1898 nr[order] = zone->free_area[order].nr_free;
1887 total += nr[order] << order; 1899 total += nr[order] << order;
1888 } 1900 }
1889 spin_unlock_irqrestore(&zone->lock, flags); 1901 spin_unlock_irqrestore(&zone->lock, flags);
1890 for (order = 0; order < MAX_ORDER; order++) 1902 for (order = 0; order < MAX_ORDER; order++)
1891 printk("%lu*%lukB ", nr[order], K(1UL) << order); 1903 printk("%lu*%lukB ", nr[order], K(1UL) << order);
1892 printk("= %lukB\n", K(total)); 1904 printk("= %lukB\n", K(total));
1893 } 1905 }
1894 1906
1895 printk("%ld total pagecache pages\n", global_page_state(NR_FILE_PAGES)); 1907 printk("%ld total pagecache pages\n", global_page_state(NR_FILE_PAGES));
1896 1908
1897 show_swap_cache_info(); 1909 show_swap_cache_info();
1898 } 1910 }
1899 1911
1900 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref) 1912 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
1901 { 1913 {
1902 zoneref->zone = zone; 1914 zoneref->zone = zone;
1903 zoneref->zone_idx = zone_idx(zone); 1915 zoneref->zone_idx = zone_idx(zone);
1904 } 1916 }
1905 1917
1906 /* 1918 /*
1907 * Builds allocation fallback zone lists. 1919 * Builds allocation fallback zone lists.
1908 * 1920 *
1909 * Add all populated zones of a node to the zonelist. 1921 * Add all populated zones of a node to the zonelist.
1910 */ 1922 */
1911 static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist, 1923 static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
1912 int nr_zones, enum zone_type zone_type) 1924 int nr_zones, enum zone_type zone_type)
1913 { 1925 {
1914 struct zone *zone; 1926 struct zone *zone;
1915 1927
1916 BUG_ON(zone_type >= MAX_NR_ZONES); 1928 BUG_ON(zone_type >= MAX_NR_ZONES);
1917 zone_type++; 1929 zone_type++;
1918 1930
1919 do { 1931 do {
1920 zone_type--; 1932 zone_type--;
1921 zone = pgdat->node_zones + zone_type; 1933 zone = pgdat->node_zones + zone_type;
1922 if (populated_zone(zone)) { 1934 if (populated_zone(zone)) {
1923 zoneref_set_zone(zone, 1935 zoneref_set_zone(zone,
1924 &zonelist->_zonerefs[nr_zones++]); 1936 &zonelist->_zonerefs[nr_zones++]);
1925 check_highest_zone(zone_type); 1937 check_highest_zone(zone_type);
1926 } 1938 }
1927 1939
1928 } while (zone_type); 1940 } while (zone_type);
1929 return nr_zones; 1941 return nr_zones;
1930 } 1942 }
1931 1943
1932 1944
1933 /* 1945 /*
1934 * zonelist_order: 1946 * zonelist_order:
1935 * 0 = automatic detection of better ordering. 1947 * 0 = automatic detection of better ordering.
1936 * 1 = order by ([node] distance, -zonetype) 1948 * 1 = order by ([node] distance, -zonetype)
1937 * 2 = order by (-zonetype, [node] distance) 1949 * 2 = order by (-zonetype, [node] distance)
1938 * 1950 *
1939 * If not NUMA, ZONELIST_ORDER_ZONE and ZONELIST_ORDER_NODE will create 1951 * If not NUMA, ZONELIST_ORDER_ZONE and ZONELIST_ORDER_NODE will create
1940 * the same zonelist. So only NUMA can configure this param. 1952 * the same zonelist. So only NUMA can configure this param.
1941 */ 1953 */
1942 #define ZONELIST_ORDER_DEFAULT 0 1954 #define ZONELIST_ORDER_DEFAULT 0
1943 #define ZONELIST_ORDER_NODE 1 1955 #define ZONELIST_ORDER_NODE 1
1944 #define ZONELIST_ORDER_ZONE 2 1956 #define ZONELIST_ORDER_ZONE 2
1945 1957
1946 /* zonelist order in the kernel. 1958 /* zonelist order in the kernel.
1947 * set_zonelist_order() will set this to NODE or ZONE. 1959 * set_zonelist_order() will set this to NODE or ZONE.
1948 */ 1960 */
1949 static int current_zonelist_order = ZONELIST_ORDER_DEFAULT; 1961 static int current_zonelist_order = ZONELIST_ORDER_DEFAULT;
1950 static char zonelist_order_name[3][8] = {"Default", "Node", "Zone"}; 1962 static char zonelist_order_name[3][8] = {"Default", "Node", "Zone"};
1951 1963
1952 1964
1953 #ifdef CONFIG_NUMA 1965 #ifdef CONFIG_NUMA
1954 /* The value user specified ....changed by config */ 1966 /* The value user specified ....changed by config */
1955 static int user_zonelist_order = ZONELIST_ORDER_DEFAULT; 1967 static int user_zonelist_order = ZONELIST_ORDER_DEFAULT;
1956 /* string for sysctl */ 1968 /* string for sysctl */
1957 #define NUMA_ZONELIST_ORDER_LEN 16 1969 #define NUMA_ZONELIST_ORDER_LEN 16
1958 char numa_zonelist_order[16] = "default"; 1970 char numa_zonelist_order[16] = "default";
1959 1971
1960 /* 1972 /*
1961 * interface for configure zonelist ordering. 1973 * interface for configure zonelist ordering.
1962 * command line option "numa_zonelist_order" 1974 * command line option "numa_zonelist_order"
1963 * = "[dD]efault - default, automatic configuration. 1975 * = "[dD]efault - default, automatic configuration.
1964 * = "[nN]ode - order by node locality, then by zone within node 1976 * = "[nN]ode - order by node locality, then by zone within node
1965 * = "[zZ]one - order by zone, then by locality within zone 1977 * = "[zZ]one - order by zone, then by locality within zone
1966 */ 1978 */
1967 1979
1968 static int __parse_numa_zonelist_order(char *s) 1980 static int __parse_numa_zonelist_order(char *s)
1969 { 1981 {
1970 if (*s == 'd' || *s == 'D') { 1982 if (*s == 'd' || *s == 'D') {
1971 user_zonelist_order = ZONELIST_ORDER_DEFAULT; 1983 user_zonelist_order = ZONELIST_ORDER_DEFAULT;
1972 } else if (*s == 'n' || *s == 'N') { 1984 } else if (*s == 'n' || *s == 'N') {
1973 user_zonelist_order = ZONELIST_ORDER_NODE; 1985 user_zonelist_order = ZONELIST_ORDER_NODE;
1974 } else if (*s == 'z' || *s == 'Z') { 1986 } else if (*s == 'z' || *s == 'Z') {
1975 user_zonelist_order = ZONELIST_ORDER_ZONE; 1987 user_zonelist_order = ZONELIST_ORDER_ZONE;
1976 } else { 1988 } else {
1977 printk(KERN_WARNING 1989 printk(KERN_WARNING
1978 "Ignoring invalid numa_zonelist_order value: " 1990 "Ignoring invalid numa_zonelist_order value: "
1979 "%s\n", s); 1991 "%s\n", s);
1980 return -EINVAL; 1992 return -EINVAL;
1981 } 1993 }
1982 return 0; 1994 return 0;
1983 } 1995 }
1984 1996
1985 static __init int setup_numa_zonelist_order(char *s) 1997 static __init int setup_numa_zonelist_order(char *s)
1986 { 1998 {
1987 if (s) 1999 if (s)
1988 return __parse_numa_zonelist_order(s); 2000 return __parse_numa_zonelist_order(s);
1989 return 0; 2001 return 0;
1990 } 2002 }
1991 early_param("numa_zonelist_order", setup_numa_zonelist_order); 2003 early_param("numa_zonelist_order", setup_numa_zonelist_order);
1992 2004
1993 /* 2005 /*
1994 * sysctl handler for numa_zonelist_order 2006 * sysctl handler for numa_zonelist_order
1995 */ 2007 */
1996 int numa_zonelist_order_handler(ctl_table *table, int write, 2008 int numa_zonelist_order_handler(ctl_table *table, int write,
1997 struct file *file, void __user *buffer, size_t *length, 2009 struct file *file, void __user *buffer, size_t *length,
1998 loff_t *ppos) 2010 loff_t *ppos)
1999 { 2011 {
2000 char saved_string[NUMA_ZONELIST_ORDER_LEN]; 2012 char saved_string[NUMA_ZONELIST_ORDER_LEN];
2001 int ret; 2013 int ret;
2002 2014
2003 if (write) 2015 if (write)
2004 strncpy(saved_string, (char*)table->data, 2016 strncpy(saved_string, (char*)table->data,
2005 NUMA_ZONELIST_ORDER_LEN); 2017 NUMA_ZONELIST_ORDER_LEN);
2006 ret = proc_dostring(table, write, file, buffer, length, ppos); 2018 ret = proc_dostring(table, write, file, buffer, length, ppos);
2007 if (ret) 2019 if (ret)
2008 return ret; 2020 return ret;
2009 if (write) { 2021 if (write) {
2010 int oldval = user_zonelist_order; 2022 int oldval = user_zonelist_order;
2011 if (__parse_numa_zonelist_order((char*)table->data)) { 2023 if (__parse_numa_zonelist_order((char*)table->data)) {
2012 /* 2024 /*
2013 * bogus value. restore saved string 2025 * bogus value. restore saved string
2014 */ 2026 */
2015 strncpy((char*)table->data, saved_string, 2027 strncpy((char*)table->data, saved_string,
2016 NUMA_ZONELIST_ORDER_LEN); 2028 NUMA_ZONELIST_ORDER_LEN);
2017 user_zonelist_order = oldval; 2029 user_zonelist_order = oldval;
2018 } else if (oldval != user_zonelist_order) 2030 } else if (oldval != user_zonelist_order)
2019 build_all_zonelists(); 2031 build_all_zonelists();
2020 } 2032 }
2021 return 0; 2033 return 0;
2022 } 2034 }
2023 2035
2024 2036
2025 #define MAX_NODE_LOAD (num_online_nodes()) 2037 #define MAX_NODE_LOAD (num_online_nodes())
2026 static int node_load[MAX_NUMNODES]; 2038 static int node_load[MAX_NUMNODES];
2027 2039
2028 /** 2040 /**
2029 * find_next_best_node - find the next node that should appear in a given node's fallback list 2041 * find_next_best_node - find the next node that should appear in a given node's fallback list
2030 * @node: node whose fallback list we're appending 2042 * @node: node whose fallback list we're appending
2031 * @used_node_mask: nodemask_t of already used nodes 2043 * @used_node_mask: nodemask_t of already used nodes
2032 * 2044 *
2033 * We use a number of factors to determine which is the next node that should 2045 * We use a number of factors to determine which is the next node that should
2034 * appear on a given node's fallback list. The node should not have appeared 2046 * appear on a given node's fallback list. The node should not have appeared
2035 * already in @node's fallback list, and it should be the next closest node 2047 * already in @node's fallback list, and it should be the next closest node
2036 * according to the distance array (which contains arbitrary distance values 2048 * according to the distance array (which contains arbitrary distance values
2037 * from each node to each node in the system), and should also prefer nodes 2049 * from each node to each node in the system), and should also prefer nodes
2038 * with no CPUs, since presumably they'll have very little allocation pressure 2050 * with no CPUs, since presumably they'll have very little allocation pressure
2039 * on them otherwise. 2051 * on them otherwise.
2040 * It returns -1 if no node is found. 2052 * It returns -1 if no node is found.
2041 */ 2053 */
2042 static int find_next_best_node(int node, nodemask_t *used_node_mask) 2054 static int find_next_best_node(int node, nodemask_t *used_node_mask)
2043 { 2055 {
2044 int n, val; 2056 int n, val;
2045 int min_val = INT_MAX; 2057 int min_val = INT_MAX;
2046 int best_node = -1; 2058 int best_node = -1;
2047 node_to_cpumask_ptr(tmp, 0); 2059 node_to_cpumask_ptr(tmp, 0);
2048 2060
2049 /* Use the local node if we haven't already */ 2061 /* Use the local node if we haven't already */
2050 if (!node_isset(node, *used_node_mask)) { 2062 if (!node_isset(node, *used_node_mask)) {
2051 node_set(node, *used_node_mask); 2063 node_set(node, *used_node_mask);
2052 return node; 2064 return node;
2053 } 2065 }
2054 2066
2055 for_each_node_state(n, N_HIGH_MEMORY) { 2067 for_each_node_state(n, N_HIGH_MEMORY) {
2056 2068
2057 /* Don't want a node to appear more than once */ 2069 /* Don't want a node to appear more than once */
2058 if (node_isset(n, *used_node_mask)) 2070 if (node_isset(n, *used_node_mask))
2059 continue; 2071 continue;
2060 2072
2061 /* Use the distance array to find the distance */ 2073 /* Use the distance array to find the distance */
2062 val = node_distance(node, n); 2074 val = node_distance(node, n);
2063 2075
2064 /* Penalize nodes under us ("prefer the next node") */ 2076 /* Penalize nodes under us ("prefer the next node") */
2065 val += (n < node); 2077 val += (n < node);
2066 2078
2067 /* Give preference to headless and unused nodes */ 2079 /* Give preference to headless and unused nodes */
2068 node_to_cpumask_ptr_next(tmp, n); 2080 node_to_cpumask_ptr_next(tmp, n);
2069 if (!cpus_empty(*tmp)) 2081 if (!cpus_empty(*tmp))
2070 val += PENALTY_FOR_NODE_WITH_CPUS; 2082 val += PENALTY_FOR_NODE_WITH_CPUS;
2071 2083
2072 /* Slight preference for less loaded node */ 2084 /* Slight preference for less loaded node */
2073 val *= (MAX_NODE_LOAD*MAX_NUMNODES); 2085 val *= (MAX_NODE_LOAD*MAX_NUMNODES);
2074 val += node_load[n]; 2086 val += node_load[n];
2075 2087
2076 if (val < min_val) { 2088 if (val < min_val) {
2077 min_val = val; 2089 min_val = val;
2078 best_node = n; 2090 best_node = n;
2079 } 2091 }
2080 } 2092 }
2081 2093
2082 if (best_node >= 0) 2094 if (best_node >= 0)
2083 node_set(best_node, *used_node_mask); 2095 node_set(best_node, *used_node_mask);
2084 2096
2085 return best_node; 2097 return best_node;
2086 } 2098 }
2087 2099
2088 2100
2089 /* 2101 /*
2090 * Build zonelists ordered by node and zones within node. 2102 * Build zonelists ordered by node and zones within node.
2091 * This results in maximum locality--normal zone overflows into local 2103 * This results in maximum locality--normal zone overflows into local
2092 * DMA zone, if any--but risks exhausting DMA zone. 2104 * DMA zone, if any--but risks exhausting DMA zone.
2093 */ 2105 */
2094 static void build_zonelists_in_node_order(pg_data_t *pgdat, int node) 2106 static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
2095 { 2107 {
2096 int j; 2108 int j;
2097 struct zonelist *zonelist; 2109 struct zonelist *zonelist;
2098 2110
2099 zonelist = &pgdat->node_zonelists[0]; 2111 zonelist = &pgdat->node_zonelists[0];
2100 for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++) 2112 for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++)
2101 ; 2113 ;
2102 j = build_zonelists_node(NODE_DATA(node), zonelist, j, 2114 j = build_zonelists_node(NODE_DATA(node), zonelist, j,
2103 MAX_NR_ZONES - 1); 2115 MAX_NR_ZONES - 1);
2104 zonelist->_zonerefs[j].zone = NULL; 2116 zonelist->_zonerefs[j].zone = NULL;
2105 zonelist->_zonerefs[j].zone_idx = 0; 2117 zonelist->_zonerefs[j].zone_idx = 0;
2106 } 2118 }
2107 2119
2108 /* 2120 /*
2109 * Build gfp_thisnode zonelists 2121 * Build gfp_thisnode zonelists
2110 */ 2122 */
2111 static void build_thisnode_zonelists(pg_data_t *pgdat) 2123 static void build_thisnode_zonelists(pg_data_t *pgdat)
2112 { 2124 {
2113 int j; 2125 int j;
2114 struct zonelist *zonelist; 2126 struct zonelist *zonelist;
2115 2127
2116 zonelist = &pgdat->node_zonelists[1]; 2128 zonelist = &pgdat->node_zonelists[1];
2117 j = build_zonelists_node(pgdat, zonelist, 0, MAX_NR_ZONES - 1); 2129 j = build_zonelists_node(pgdat, zonelist, 0, MAX_NR_ZONES - 1);
2118 zonelist->_zonerefs[j].zone = NULL; 2130 zonelist->_zonerefs[j].zone = NULL;
2119 zonelist->_zonerefs[j].zone_idx = 0; 2131 zonelist->_zonerefs[j].zone_idx = 0;
2120 } 2132 }
2121 2133
2122 /* 2134 /*
2123 * Build zonelists ordered by zone and nodes within zones. 2135 * Build zonelists ordered by zone and nodes within zones.
2124 * This results in conserving DMA zone[s] until all Normal memory is 2136 * This results in conserving DMA zone[s] until all Normal memory is
2125 * exhausted, but results in overflowing to remote node while memory 2137 * exhausted, but results in overflowing to remote node while memory
2126 * may still exist in local DMA zone. 2138 * may still exist in local DMA zone.
2127 */ 2139 */
2128 static int node_order[MAX_NUMNODES]; 2140 static int node_order[MAX_NUMNODES];
2129 2141
2130 static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes) 2142 static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
2131 { 2143 {
2132 int pos, j, node; 2144 int pos, j, node;
2133 int zone_type; /* needs to be signed */ 2145 int zone_type; /* needs to be signed */
2134 struct zone *z; 2146 struct zone *z;
2135 struct zonelist *zonelist; 2147 struct zonelist *zonelist;
2136 2148
2137 zonelist = &pgdat->node_zonelists[0]; 2149 zonelist = &pgdat->node_zonelists[0];
2138 pos = 0; 2150 pos = 0;
2139 for (zone_type = MAX_NR_ZONES - 1; zone_type >= 0; zone_type--) { 2151 for (zone_type = MAX_NR_ZONES - 1; zone_type >= 0; zone_type--) {
2140 for (j = 0; j < nr_nodes; j++) { 2152 for (j = 0; j < nr_nodes; j++) {
2141 node = node_order[j]; 2153 node = node_order[j];
2142 z = &NODE_DATA(node)->node_zones[zone_type]; 2154 z = &NODE_DATA(node)->node_zones[zone_type];
2143 if (populated_zone(z)) { 2155 if (populated_zone(z)) {
2144 zoneref_set_zone(z, 2156 zoneref_set_zone(z,
2145 &zonelist->_zonerefs[pos++]); 2157 &zonelist->_zonerefs[pos++]);
2146 check_highest_zone(zone_type); 2158 check_highest_zone(zone_type);
2147 } 2159 }
2148 } 2160 }
2149 } 2161 }
2150 zonelist->_zonerefs[pos].zone = NULL; 2162 zonelist->_zonerefs[pos].zone = NULL;
2151 zonelist->_zonerefs[pos].zone_idx = 0; 2163 zonelist->_zonerefs[pos].zone_idx = 0;
2152 } 2164 }
2153 2165
2154 static int default_zonelist_order(void) 2166 static int default_zonelist_order(void)
2155 { 2167 {
2156 int nid, zone_type; 2168 int nid, zone_type;
2157 unsigned long low_kmem_size,total_size; 2169 unsigned long low_kmem_size,total_size;
2158 struct zone *z; 2170 struct zone *z;
2159 int average_size; 2171 int average_size;
2160 /* 2172 /*
2161 * ZONE_DMA and ZONE_DMA32 can be very small area in the sytem. 2173 * ZONE_DMA and ZONE_DMA32 can be very small area in the sytem.
2162 * If they are really small and used heavily, the system can fall 2174 * If they are really small and used heavily, the system can fall
2163 * into OOM very easily. 2175 * into OOM very easily.
2164 * This function detect ZONE_DMA/DMA32 size and confgigures zone order. 2176 * This function detect ZONE_DMA/DMA32 size and confgigures zone order.
2165 */ 2177 */
2166 /* Is there ZONE_NORMAL ? (ex. ppc has only DMA zone..) */ 2178 /* Is there ZONE_NORMAL ? (ex. ppc has only DMA zone..) */
2167 low_kmem_size = 0; 2179 low_kmem_size = 0;
2168 total_size = 0; 2180 total_size = 0;
2169 for_each_online_node(nid) { 2181 for_each_online_node(nid) {
2170 for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) { 2182 for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
2171 z = &NODE_DATA(nid)->node_zones[zone_type]; 2183 z = &NODE_DATA(nid)->node_zones[zone_type];
2172 if (populated_zone(z)) { 2184 if (populated_zone(z)) {
2173 if (zone_type < ZONE_NORMAL) 2185 if (zone_type < ZONE_NORMAL)
2174 low_kmem_size += z->present_pages; 2186 low_kmem_size += z->present_pages;
2175 total_size += z->present_pages; 2187 total_size += z->present_pages;
2176 } 2188 }
2177 } 2189 }
2178 } 2190 }
2179 if (!low_kmem_size || /* there are no DMA area. */ 2191 if (!low_kmem_size || /* there are no DMA area. */
2180 low_kmem_size > total_size/2) /* DMA/DMA32 is big. */ 2192 low_kmem_size > total_size/2) /* DMA/DMA32 is big. */
2181 return ZONELIST_ORDER_NODE; 2193 return ZONELIST_ORDER_NODE;
2182 /* 2194 /*
2183 * look into each node's config. 2195 * look into each node's config.
2184 * If there is a node whose DMA/DMA32 memory is very big area on 2196 * If there is a node whose DMA/DMA32 memory is very big area on
2185 * local memory, NODE_ORDER may be suitable. 2197 * local memory, NODE_ORDER may be suitable.
2186 */ 2198 */
2187 average_size = total_size / 2199 average_size = total_size /
2188 (nodes_weight(node_states[N_HIGH_MEMORY]) + 1); 2200 (nodes_weight(node_states[N_HIGH_MEMORY]) + 1);
2189 for_each_online_node(nid) { 2201 for_each_online_node(nid) {
2190 low_kmem_size = 0; 2202 low_kmem_size = 0;
2191 total_size = 0; 2203 total_size = 0;
2192 for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) { 2204 for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
2193 z = &NODE_DATA(nid)->node_zones[zone_type]; 2205 z = &NODE_DATA(nid)->node_zones[zone_type];
2194 if (populated_zone(z)) { 2206 if (populated_zone(z)) {
2195 if (zone_type < ZONE_NORMAL) 2207 if (zone_type < ZONE_NORMAL)
2196 low_kmem_size += z->present_pages; 2208 low_kmem_size += z->present_pages;
2197 total_size += z->present_pages; 2209 total_size += z->present_pages;
2198 } 2210 }
2199 } 2211 }
2200 if (low_kmem_size && 2212 if (low_kmem_size &&
2201 total_size > average_size && /* ignore small node */ 2213 total_size > average_size && /* ignore small node */
2202 low_kmem_size > total_size * 70/100) 2214 low_kmem_size > total_size * 70/100)
2203 return ZONELIST_ORDER_NODE; 2215 return ZONELIST_ORDER_NODE;
2204 } 2216 }
2205 return ZONELIST_ORDER_ZONE; 2217 return ZONELIST_ORDER_ZONE;
2206 } 2218 }
2207 2219
2208 static void set_zonelist_order(void) 2220 static void set_zonelist_order(void)
2209 { 2221 {
2210 if (user_zonelist_order == ZONELIST_ORDER_DEFAULT) 2222 if (user_zonelist_order == ZONELIST_ORDER_DEFAULT)
2211 current_zonelist_order = default_zonelist_order(); 2223 current_zonelist_order = default_zonelist_order();
2212 else 2224 else
2213 current_zonelist_order = user_zonelist_order; 2225 current_zonelist_order = user_zonelist_order;
2214 } 2226 }
2215 2227
2216 static void build_zonelists(pg_data_t *pgdat) 2228 static void build_zonelists(pg_data_t *pgdat)
2217 { 2229 {
2218 int j, node, load; 2230 int j, node, load;
2219 enum zone_type i; 2231 enum zone_type i;
2220 nodemask_t used_mask; 2232 nodemask_t used_mask;
2221 int local_node, prev_node; 2233 int local_node, prev_node;
2222 struct zonelist *zonelist; 2234 struct zonelist *zonelist;
2223 int order = current_zonelist_order; 2235 int order = current_zonelist_order;
2224 2236
2225 /* initialize zonelists */ 2237 /* initialize zonelists */
2226 for (i = 0; i < MAX_ZONELISTS; i++) { 2238 for (i = 0; i < MAX_ZONELISTS; i++) {
2227 zonelist = pgdat->node_zonelists + i; 2239 zonelist = pgdat->node_zonelists + i;
2228 zonelist->_zonerefs[0].zone = NULL; 2240 zonelist->_zonerefs[0].zone = NULL;
2229 zonelist->_zonerefs[0].zone_idx = 0; 2241 zonelist->_zonerefs[0].zone_idx = 0;
2230 } 2242 }
2231 2243
2232 /* NUMA-aware ordering of nodes */ 2244 /* NUMA-aware ordering of nodes */
2233 local_node = pgdat->node_id; 2245 local_node = pgdat->node_id;
2234 load = num_online_nodes(); 2246 load = num_online_nodes();
2235 prev_node = local_node; 2247 prev_node = local_node;
2236 nodes_clear(used_mask); 2248 nodes_clear(used_mask);
2237 2249
2238 memset(node_load, 0, sizeof(node_load)); 2250 memset(node_load, 0, sizeof(node_load));
2239 memset(node_order, 0, sizeof(node_order)); 2251 memset(node_order, 0, sizeof(node_order));
2240 j = 0; 2252 j = 0;
2241 2253
2242 while ((node = find_next_best_node(local_node, &used_mask)) >= 0) { 2254 while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
2243 int distance = node_distance(local_node, node); 2255 int distance = node_distance(local_node, node);
2244 2256
2245 /* 2257 /*
2246 * If another node is sufficiently far away then it is better 2258 * If another node is sufficiently far away then it is better
2247 * to reclaim pages in a zone before going off node. 2259 * to reclaim pages in a zone before going off node.
2248 */ 2260 */
2249 if (distance > RECLAIM_DISTANCE) 2261 if (distance > RECLAIM_DISTANCE)
2250 zone_reclaim_mode = 1; 2262 zone_reclaim_mode = 1;
2251 2263
2252 /* 2264 /*
2253 * We don't want to pressure a particular node. 2265 * We don't want to pressure a particular node.
2254 * So adding penalty to the first node in same 2266 * So adding penalty to the first node in same
2255 * distance group to make it round-robin. 2267 * distance group to make it round-robin.
2256 */ 2268 */
2257 if (distance != node_distance(local_node, prev_node)) 2269 if (distance != node_distance(local_node, prev_node))
2258 node_load[node] = load; 2270 node_load[node] = load;
2259 2271
2260 prev_node = node; 2272 prev_node = node;
2261 load--; 2273 load--;
2262 if (order == ZONELIST_ORDER_NODE) 2274 if (order == ZONELIST_ORDER_NODE)
2263 build_zonelists_in_node_order(pgdat, node); 2275 build_zonelists_in_node_order(pgdat, node);
2264 else 2276 else
2265 node_order[j++] = node; /* remember order */ 2277 node_order[j++] = node; /* remember order */
2266 } 2278 }
2267 2279
2268 if (order == ZONELIST_ORDER_ZONE) { 2280 if (order == ZONELIST_ORDER_ZONE) {
2269 /* calculate node order -- i.e., DMA last! */ 2281 /* calculate node order -- i.e., DMA last! */
2270 build_zonelists_in_zone_order(pgdat, j); 2282 build_zonelists_in_zone_order(pgdat, j);
2271 } 2283 }
2272 2284
2273 build_thisnode_zonelists(pgdat); 2285 build_thisnode_zonelists(pgdat);
2274 } 2286 }
2275 2287
2276 /* Construct the zonelist performance cache - see further mmzone.h */ 2288 /* Construct the zonelist performance cache - see further mmzone.h */
2277 static void build_zonelist_cache(pg_data_t *pgdat) 2289 static void build_zonelist_cache(pg_data_t *pgdat)
2278 { 2290 {
2279 struct zonelist *zonelist; 2291 struct zonelist *zonelist;
2280 struct zonelist_cache *zlc; 2292 struct zonelist_cache *zlc;
2281 struct zoneref *z; 2293 struct zoneref *z;
2282 2294
2283 zonelist = &pgdat->node_zonelists[0]; 2295 zonelist = &pgdat->node_zonelists[0];
2284 zonelist->zlcache_ptr = zlc = &zonelist->zlcache; 2296 zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
2285 bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST); 2297 bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
2286 for (z = zonelist->_zonerefs; z->zone; z++) 2298 for (z = zonelist->_zonerefs; z->zone; z++)
2287 zlc->z_to_n[z - zonelist->_zonerefs] = zonelist_node_idx(z); 2299 zlc->z_to_n[z - zonelist->_zonerefs] = zonelist_node_idx(z);
2288 } 2300 }
2289 2301
2290 2302
2291 #else /* CONFIG_NUMA */ 2303 #else /* CONFIG_NUMA */
2292 2304
2293 static void set_zonelist_order(void) 2305 static void set_zonelist_order(void)
2294 { 2306 {
2295 current_zonelist_order = ZONELIST_ORDER_ZONE; 2307 current_zonelist_order = ZONELIST_ORDER_ZONE;
2296 } 2308 }
2297 2309
2298 static void build_zonelists(pg_data_t *pgdat) 2310 static void build_zonelists(pg_data_t *pgdat)
2299 { 2311 {
2300 int node, local_node; 2312 int node, local_node;
2301 enum zone_type j; 2313 enum zone_type j;
2302 struct zonelist *zonelist; 2314 struct zonelist *zonelist;
2303 2315
2304 local_node = pgdat->node_id; 2316 local_node = pgdat->node_id;
2305 2317
2306 zonelist = &pgdat->node_zonelists[0]; 2318 zonelist = &pgdat->node_zonelists[0];
2307 j = build_zonelists_node(pgdat, zonelist, 0, MAX_NR_ZONES - 1); 2319 j = build_zonelists_node(pgdat, zonelist, 0, MAX_NR_ZONES - 1);
2308 2320
2309 /* 2321 /*
2310 * Now we build the zonelist so that it contains the zones 2322 * Now we build the zonelist so that it contains the zones
2311 * of all the other nodes. 2323 * of all the other nodes.
2312 * We don't want to pressure a particular node, so when 2324 * We don't want to pressure a particular node, so when
2313 * building the zones for node N, we make sure that the 2325 * building the zones for node N, we make sure that the
2314 * zones coming right after the local ones are those from 2326 * zones coming right after the local ones are those from
2315 * node N+1 (modulo N) 2327 * node N+1 (modulo N)
2316 */ 2328 */
2317 for (node = local_node + 1; node < MAX_NUMNODES; node++) { 2329 for (node = local_node + 1; node < MAX_NUMNODES; node++) {
2318 if (!node_online(node)) 2330 if (!node_online(node))
2319 continue; 2331 continue;
2320 j = build_zonelists_node(NODE_DATA(node), zonelist, j, 2332 j = build_zonelists_node(NODE_DATA(node), zonelist, j,
2321 MAX_NR_ZONES - 1); 2333 MAX_NR_ZONES - 1);
2322 } 2334 }
2323 for (node = 0; node < local_node; node++) { 2335 for (node = 0; node < local_node; node++) {
2324 if (!node_online(node)) 2336 if (!node_online(node))
2325 continue; 2337 continue;
2326 j = build_zonelists_node(NODE_DATA(node), zonelist, j, 2338 j = build_zonelists_node(NODE_DATA(node), zonelist, j,
2327 MAX_NR_ZONES - 1); 2339 MAX_NR_ZONES - 1);
2328 } 2340 }
2329 2341
2330 zonelist->_zonerefs[j].zone = NULL; 2342 zonelist->_zonerefs[j].zone = NULL;
2331 zonelist->_zonerefs[j].zone_idx = 0; 2343 zonelist->_zonerefs[j].zone_idx = 0;
2332 } 2344 }
2333 2345
2334 /* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */ 2346 /* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */
2335 static void build_zonelist_cache(pg_data_t *pgdat) 2347 static void build_zonelist_cache(pg_data_t *pgdat)
2336 { 2348 {
2337 pgdat->node_zonelists[0].zlcache_ptr = NULL; 2349 pgdat->node_zonelists[0].zlcache_ptr = NULL;
2338 pgdat->node_zonelists[1].zlcache_ptr = NULL; 2350 pgdat->node_zonelists[1].zlcache_ptr = NULL;
2339 } 2351 }
2340 2352
2341 #endif /* CONFIG_NUMA */ 2353 #endif /* CONFIG_NUMA */
2342 2354
2343 /* return values int ....just for stop_machine_run() */ 2355 /* return values int ....just for stop_machine_run() */
2344 static int __build_all_zonelists(void *dummy) 2356 static int __build_all_zonelists(void *dummy)
2345 { 2357 {
2346 int nid; 2358 int nid;
2347 2359
2348 for_each_online_node(nid) { 2360 for_each_online_node(nid) {
2349 pg_data_t *pgdat = NODE_DATA(nid); 2361 pg_data_t *pgdat = NODE_DATA(nid);
2350 2362
2351 build_zonelists(pgdat); 2363 build_zonelists(pgdat);
2352 build_zonelist_cache(pgdat); 2364 build_zonelist_cache(pgdat);
2353 } 2365 }
2354 return 0; 2366 return 0;
2355 } 2367 }
2356 2368
2357 void build_all_zonelists(void) 2369 void build_all_zonelists(void)
2358 { 2370 {
2359 set_zonelist_order(); 2371 set_zonelist_order();
2360 2372
2361 if (system_state == SYSTEM_BOOTING) { 2373 if (system_state == SYSTEM_BOOTING) {
2362 __build_all_zonelists(NULL); 2374 __build_all_zonelists(NULL);
2363 cpuset_init_current_mems_allowed(); 2375 cpuset_init_current_mems_allowed();
2364 } else { 2376 } else {
2365 /* we have to stop all cpus to guarantee there is no user 2377 /* we have to stop all cpus to guarantee there is no user
2366 of zonelist */ 2378 of zonelist */
2367 stop_machine_run(__build_all_zonelists, NULL, NR_CPUS); 2379 stop_machine_run(__build_all_zonelists, NULL, NR_CPUS);
2368 /* cpuset refresh routine should be here */ 2380 /* cpuset refresh routine should be here */
2369 } 2381 }
2370 vm_total_pages = nr_free_pagecache_pages(); 2382 vm_total_pages = nr_free_pagecache_pages();
2371 /* 2383 /*
2372 * Disable grouping by mobility if the number of pages in the 2384 * Disable grouping by mobility if the number of pages in the
2373 * system is too low to allow the mechanism to work. It would be 2385 * system is too low to allow the mechanism to work. It would be
2374 * more accurate, but expensive to check per-zone. This check is 2386 * more accurate, but expensive to check per-zone. This check is
2375 * made on memory-hotadd so a system can start with mobility 2387 * made on memory-hotadd so a system can start with mobility
2376 * disabled and enable it later 2388 * disabled and enable it later
2377 */ 2389 */
2378 if (vm_total_pages < (pageblock_nr_pages * MIGRATE_TYPES)) 2390 if (vm_total_pages < (pageblock_nr_pages * MIGRATE_TYPES))
2379 page_group_by_mobility_disabled = 1; 2391 page_group_by_mobility_disabled = 1;
2380 else 2392 else
2381 page_group_by_mobility_disabled = 0; 2393 page_group_by_mobility_disabled = 0;
2382 2394
2383 printk("Built %i zonelists in %s order, mobility grouping %s. " 2395 printk("Built %i zonelists in %s order, mobility grouping %s. "
2384 "Total pages: %ld\n", 2396 "Total pages: %ld\n",
2385 num_online_nodes(), 2397 num_online_nodes(),
2386 zonelist_order_name[current_zonelist_order], 2398 zonelist_order_name[current_zonelist_order],
2387 page_group_by_mobility_disabled ? "off" : "on", 2399 page_group_by_mobility_disabled ? "off" : "on",
2388 vm_total_pages); 2400 vm_total_pages);
2389 #ifdef CONFIG_NUMA 2401 #ifdef CONFIG_NUMA
2390 printk("Policy zone: %s\n", zone_names[policy_zone]); 2402 printk("Policy zone: %s\n", zone_names[policy_zone]);
2391 #endif 2403 #endif
2392 } 2404 }
2393 2405
2394 /* 2406 /*
2395 * Helper functions to size the waitqueue hash table. 2407 * Helper functions to size the waitqueue hash table.
2396 * Essentially these want to choose hash table sizes sufficiently 2408 * Essentially these want to choose hash table sizes sufficiently
2397 * large so that collisions trying to wait on pages are rare. 2409 * large so that collisions trying to wait on pages are rare.
2398 * But in fact, the number of active page waitqueues on typical 2410 * But in fact, the number of active page waitqueues on typical
2399 * systems is ridiculously low, less than 200. So this is even 2411 * systems is ridiculously low, less than 200. So this is even
2400 * conservative, even though it seems large. 2412 * conservative, even though it seems large.
2401 * 2413 *
2402 * The constant PAGES_PER_WAITQUEUE specifies the ratio of pages to 2414 * The constant PAGES_PER_WAITQUEUE specifies the ratio of pages to
2403 * waitqueues, i.e. the size of the waitq table given the number of pages. 2415 * waitqueues, i.e. the size of the waitq table given the number of pages.
2404 */ 2416 */
2405 #define PAGES_PER_WAITQUEUE 256 2417 #define PAGES_PER_WAITQUEUE 256
2406 2418
2407 #ifndef CONFIG_MEMORY_HOTPLUG 2419 #ifndef CONFIG_MEMORY_HOTPLUG
2408 static inline unsigned long wait_table_hash_nr_entries(unsigned long pages) 2420 static inline unsigned long wait_table_hash_nr_entries(unsigned long pages)
2409 { 2421 {
2410 unsigned long size = 1; 2422 unsigned long size = 1;
2411 2423
2412 pages /= PAGES_PER_WAITQUEUE; 2424 pages /= PAGES_PER_WAITQUEUE;
2413 2425
2414 while (size < pages) 2426 while (size < pages)
2415 size <<= 1; 2427 size <<= 1;
2416 2428
2417 /* 2429 /*
2418 * Once we have dozens or even hundreds of threads sleeping 2430 * Once we have dozens or even hundreds of threads sleeping
2419 * on IO we've got bigger problems than wait queue collision. 2431 * on IO we've got bigger problems than wait queue collision.
2420 * Limit the size of the wait table to a reasonable size. 2432 * Limit the size of the wait table to a reasonable size.
2421 */ 2433 */
2422 size = min(size, 4096UL); 2434 size = min(size, 4096UL);
2423 2435
2424 return max(size, 4UL); 2436 return max(size, 4UL);
2425 } 2437 }
2426 #else 2438 #else
2427 /* 2439 /*
2428 * A zone's size might be changed by hot-add, so it is not possible to determine 2440 * A zone's size might be changed by hot-add, so it is not possible to determine
2429 * a suitable size for its wait_table. So we use the maximum size now. 2441 * a suitable size for its wait_table. So we use the maximum size now.
2430 * 2442 *
2431 * The max wait table size = 4096 x sizeof(wait_queue_head_t). ie: 2443 * The max wait table size = 4096 x sizeof(wait_queue_head_t). ie:
2432 * 2444 *
2433 * i386 (preemption config) : 4096 x 16 = 64Kbyte. 2445 * i386 (preemption config) : 4096 x 16 = 64Kbyte.
2434 * ia64, x86-64 (no preemption): 4096 x 20 = 80Kbyte. 2446 * ia64, x86-64 (no preemption): 4096 x 20 = 80Kbyte.
2435 * ia64, x86-64 (preemption) : 4096 x 24 = 96Kbyte. 2447 * ia64, x86-64 (preemption) : 4096 x 24 = 96Kbyte.
2436 * 2448 *
2437 * The maximum entries are prepared when a zone's memory is (512K + 256) pages 2449 * The maximum entries are prepared when a zone's memory is (512K + 256) pages
2438 * or more by the traditional way. (See above). It equals: 2450 * or more by the traditional way. (See above). It equals:
2439 * 2451 *
2440 * i386, x86-64, powerpc(4K page size) : = ( 2G + 1M)byte. 2452 * i386, x86-64, powerpc(4K page size) : = ( 2G + 1M)byte.
2441 * ia64(16K page size) : = ( 8G + 4M)byte. 2453 * ia64(16K page size) : = ( 8G + 4M)byte.
2442 * powerpc (64K page size) : = (32G +16M)byte. 2454 * powerpc (64K page size) : = (32G +16M)byte.
2443 */ 2455 */
2444 static inline unsigned long wait_table_hash_nr_entries(unsigned long pages) 2456 static inline unsigned long wait_table_hash_nr_entries(unsigned long pages)
2445 { 2457 {
2446 return 4096UL; 2458 return 4096UL;
2447 } 2459 }
2448 #endif 2460 #endif
2449 2461
2450 /* 2462 /*
2451 * This is an integer logarithm so that shifts can be used later 2463 * This is an integer logarithm so that shifts can be used later
2452 * to extract the more random high bits from the multiplicative 2464 * to extract the more random high bits from the multiplicative
2453 * hash function before the remainder is taken. 2465 * hash function before the remainder is taken.
2454 */ 2466 */
2455 static inline unsigned long wait_table_bits(unsigned long size) 2467 static inline unsigned long wait_table_bits(unsigned long size)
2456 { 2468 {
2457 return ffz(~size); 2469 return ffz(~size);
2458 } 2470 }
2459 2471
2460 #define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1)) 2472 #define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
2461 2473
2462 /* 2474 /*
2463 * Mark a number of pageblocks as MIGRATE_RESERVE. The number 2475 * Mark a number of pageblocks as MIGRATE_RESERVE. The number
2464 * of blocks reserved is based on zone->pages_min. The memory within the 2476 * of blocks reserved is based on zone->pages_min. The memory within the
2465 * reserve will tend to store contiguous free pages. Setting min_free_kbytes 2477 * reserve will tend to store contiguous free pages. Setting min_free_kbytes
2466 * higher will lead to a bigger reserve which will get freed as contiguous 2478 * higher will lead to a bigger reserve which will get freed as contiguous
2467 * blocks as reclaim kicks in 2479 * blocks as reclaim kicks in
2468 */ 2480 */
2469 static void setup_zone_migrate_reserve(struct zone *zone) 2481 static void setup_zone_migrate_reserve(struct zone *zone)
2470 { 2482 {
2471 unsigned long start_pfn, pfn, end_pfn; 2483 unsigned long start_pfn, pfn, end_pfn;
2472 struct page *page; 2484 struct page *page;
2473 unsigned long reserve, block_migratetype; 2485 unsigned long reserve, block_migratetype;
2474 2486
2475 /* Get the start pfn, end pfn and the number of blocks to reserve */ 2487 /* Get the start pfn, end pfn and the number of blocks to reserve */
2476 start_pfn = zone->zone_start_pfn; 2488 start_pfn = zone->zone_start_pfn;
2477 end_pfn = start_pfn + zone->spanned_pages; 2489 end_pfn = start_pfn + zone->spanned_pages;
2478 reserve = roundup(zone->pages_min, pageblock_nr_pages) >> 2490 reserve = roundup(zone->pages_min, pageblock_nr_pages) >>
2479 pageblock_order; 2491 pageblock_order;
2480 2492
2481 for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) { 2493 for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
2482 if (!pfn_valid(pfn)) 2494 if (!pfn_valid(pfn))
2483 continue; 2495 continue;
2484 page = pfn_to_page(pfn); 2496 page = pfn_to_page(pfn);
2485 2497
2486 /* Blocks with reserved pages will never free, skip them. */ 2498 /* Blocks with reserved pages will never free, skip them. */
2487 if (PageReserved(page)) 2499 if (PageReserved(page))
2488 continue; 2500 continue;
2489 2501
2490 block_migratetype = get_pageblock_migratetype(page); 2502 block_migratetype = get_pageblock_migratetype(page);
2491 2503
2492 /* If this block is reserved, account for it */ 2504 /* If this block is reserved, account for it */
2493 if (reserve > 0 && block_migratetype == MIGRATE_RESERVE) { 2505 if (reserve > 0 && block_migratetype == MIGRATE_RESERVE) {
2494 reserve--; 2506 reserve--;
2495 continue; 2507 continue;
2496 } 2508 }
2497 2509
2498 /* Suitable for reserving if this block is movable */ 2510 /* Suitable for reserving if this block is movable */
2499 if (reserve > 0 && block_migratetype == MIGRATE_MOVABLE) { 2511 if (reserve > 0 && block_migratetype == MIGRATE_MOVABLE) {
2500 set_pageblock_migratetype(page, MIGRATE_RESERVE); 2512 set_pageblock_migratetype(page, MIGRATE_RESERVE);
2501 move_freepages_block(zone, page, MIGRATE_RESERVE); 2513 move_freepages_block(zone, page, MIGRATE_RESERVE);
2502 reserve--; 2514 reserve--;
2503 continue; 2515 continue;
2504 } 2516 }
2505 2517
2506 /* 2518 /*
2507 * If the reserve is met and this is a previous reserved block, 2519 * If the reserve is met and this is a previous reserved block,
2508 * take it back 2520 * take it back
2509 */ 2521 */
2510 if (block_migratetype == MIGRATE_RESERVE) { 2522 if (block_migratetype == MIGRATE_RESERVE) {
2511 set_pageblock_migratetype(page, MIGRATE_MOVABLE); 2523 set_pageblock_migratetype(page, MIGRATE_MOVABLE);
2512 move_freepages_block(zone, page, MIGRATE_MOVABLE); 2524 move_freepages_block(zone, page, MIGRATE_MOVABLE);
2513 } 2525 }
2514 } 2526 }
2515 } 2527 }
2516 2528
2517 /* 2529 /*
2518 * Initially all pages are reserved - free ones are freed 2530 * Initially all pages are reserved - free ones are freed
2519 * up by free_all_bootmem() once the early boot process is 2531 * up by free_all_bootmem() once the early boot process is
2520 * done. Non-atomic initialization, single-pass. 2532 * done. Non-atomic initialization, single-pass.
2521 */ 2533 */
2522 void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, 2534 void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
2523 unsigned long start_pfn, enum memmap_context context) 2535 unsigned long start_pfn, enum memmap_context context)
2524 { 2536 {
2525 struct page *page; 2537 struct page *page;
2526 unsigned long end_pfn = start_pfn + size; 2538 unsigned long end_pfn = start_pfn + size;
2527 unsigned long pfn; 2539 unsigned long pfn;
2528 struct zone *z; 2540 struct zone *z;
2529 2541
2530 z = &NODE_DATA(nid)->node_zones[zone]; 2542 z = &NODE_DATA(nid)->node_zones[zone];
2531 for (pfn = start_pfn; pfn < end_pfn; pfn++) { 2543 for (pfn = start_pfn; pfn < end_pfn; pfn++) {
2532 /* 2544 /*
2533 * There can be holes in boot-time mem_map[]s 2545 * There can be holes in boot-time mem_map[]s
2534 * handed to this function. They do not 2546 * handed to this function. They do not
2535 * exist on hotplugged memory. 2547 * exist on hotplugged memory.
2536 */ 2548 */
2537 if (context == MEMMAP_EARLY) { 2549 if (context == MEMMAP_EARLY) {
2538 if (!early_pfn_valid(pfn)) 2550 if (!early_pfn_valid(pfn))
2539 continue; 2551 continue;
2540 if (!early_pfn_in_nid(pfn, nid)) 2552 if (!early_pfn_in_nid(pfn, nid))
2541 continue; 2553 continue;
2542 } 2554 }
2543 page = pfn_to_page(pfn); 2555 page = pfn_to_page(pfn);
2544 set_page_links(page, zone, nid, pfn); 2556 set_page_links(page, zone, nid, pfn);
2545 init_page_count(page); 2557 init_page_count(page);
2546 reset_page_mapcount(page); 2558 reset_page_mapcount(page);
2547 SetPageReserved(page); 2559 SetPageReserved(page);
2548 /* 2560 /*
2549 * Mark the block movable so that blocks are reserved for 2561 * Mark the block movable so that blocks are reserved for
2550 * movable at startup. This will force kernel allocations 2562 * movable at startup. This will force kernel allocations
2551 * to reserve their blocks rather than leaking throughout 2563 * to reserve their blocks rather than leaking throughout
2552 * the address space during boot when many long-lived 2564 * the address space during boot when many long-lived
2553 * kernel allocations are made. Later some blocks near 2565 * kernel allocations are made. Later some blocks near
2554 * the start are marked MIGRATE_RESERVE by 2566 * the start are marked MIGRATE_RESERVE by
2555 * setup_zone_migrate_reserve() 2567 * setup_zone_migrate_reserve()
2556 * 2568 *
2557 * bitmap is created for zone's valid pfn range. but memmap 2569 * bitmap is created for zone's valid pfn range. but memmap
2558 * can be created for invalid pages (for alignment) 2570 * can be created for invalid pages (for alignment)
2559 * check here not to call set_pageblock_migratetype() against 2571 * check here not to call set_pageblock_migratetype() against
2560 * pfn out of zone. 2572 * pfn out of zone.
2561 */ 2573 */
2562 if ((z->zone_start_pfn <= pfn) 2574 if ((z->zone_start_pfn <= pfn)
2563 && (pfn < z->zone_start_pfn + z->spanned_pages) 2575 && (pfn < z->zone_start_pfn + z->spanned_pages)
2564 && !(pfn & (pageblock_nr_pages - 1))) 2576 && !(pfn & (pageblock_nr_pages - 1)))
2565 set_pageblock_migratetype(page, MIGRATE_MOVABLE); 2577 set_pageblock_migratetype(page, MIGRATE_MOVABLE);
2566 2578
2567 INIT_LIST_HEAD(&page->lru); 2579 INIT_LIST_HEAD(&page->lru);
2568 #ifdef WANT_PAGE_VIRTUAL 2580 #ifdef WANT_PAGE_VIRTUAL
2569 /* The shift won't overflow because ZONE_NORMAL is below 4G. */ 2581 /* The shift won't overflow because ZONE_NORMAL is below 4G. */
2570 if (!is_highmem_idx(zone)) 2582 if (!is_highmem_idx(zone))
2571 set_page_address(page, __va(pfn << PAGE_SHIFT)); 2583 set_page_address(page, __va(pfn << PAGE_SHIFT));
2572 #endif 2584 #endif
2573 } 2585 }
2574 } 2586 }
2575 2587
2576 static void __meminit zone_init_free_lists(struct zone *zone) 2588 static void __meminit zone_init_free_lists(struct zone *zone)
2577 { 2589 {
2578 int order, t; 2590 int order, t;
2579 for_each_migratetype_order(order, t) { 2591 for_each_migratetype_order(order, t) {
2580 INIT_LIST_HEAD(&zone->free_area[order].free_list[t]); 2592 INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
2581 zone->free_area[order].nr_free = 0; 2593 zone->free_area[order].nr_free = 0;
2582 } 2594 }
2583 } 2595 }
2584 2596
2585 #ifndef __HAVE_ARCH_MEMMAP_INIT 2597 #ifndef __HAVE_ARCH_MEMMAP_INIT
2586 #define memmap_init(size, nid, zone, start_pfn) \ 2598 #define memmap_init(size, nid, zone, start_pfn) \
2587 memmap_init_zone((size), (nid), (zone), (start_pfn), MEMMAP_EARLY) 2599 memmap_init_zone((size), (nid), (zone), (start_pfn), MEMMAP_EARLY)
2588 #endif 2600 #endif
2589 2601
2590 static int zone_batchsize(struct zone *zone) 2602 static int zone_batchsize(struct zone *zone)
2591 { 2603 {
2592 int batch; 2604 int batch;
2593 2605
2594 /* 2606 /*
2595 * The per-cpu-pages pools are set to around 1000th of the 2607 * The per-cpu-pages pools are set to around 1000th of the
2596 * size of the zone. But no more than 1/2 of a meg. 2608 * size of the zone. But no more than 1/2 of a meg.
2597 * 2609 *
2598 * OK, so we don't know how big the cache is. So guess. 2610 * OK, so we don't know how big the cache is. So guess.
2599 */ 2611 */
2600 batch = zone->present_pages / 1024; 2612 batch = zone->present_pages / 1024;
2601 if (batch * PAGE_SIZE > 512 * 1024) 2613 if (batch * PAGE_SIZE > 512 * 1024)
2602 batch = (512 * 1024) / PAGE_SIZE; 2614 batch = (512 * 1024) / PAGE_SIZE;
2603 batch /= 4; /* We effectively *= 4 below */ 2615 batch /= 4; /* We effectively *= 4 below */
2604 if (batch < 1) 2616 if (batch < 1)
2605 batch = 1; 2617 batch = 1;
2606 2618
2607 /* 2619 /*
2608 * Clamp the batch to a 2^n - 1 value. Having a power 2620 * Clamp the batch to a 2^n - 1 value. Having a power
2609 * of 2 value was found to be more likely to have 2621 * of 2 value was found to be more likely to have
2610 * suboptimal cache aliasing properties in some cases. 2622 * suboptimal cache aliasing properties in some cases.
2611 * 2623 *
2612 * For example if 2 tasks are alternately allocating 2624 * For example if 2 tasks are alternately allocating
2613 * batches of pages, one task can end up with a lot 2625 * batches of pages, one task can end up with a lot
2614 * of pages of one half of the possible page colors 2626 * of pages of one half of the possible page colors
2615 * and the other with pages of the other colors. 2627 * and the other with pages of the other colors.
2616 */ 2628 */
2617 batch = (1 << (fls(batch + batch/2)-1)) - 1; 2629 batch = (1 << (fls(batch + batch/2)-1)) - 1;
2618 2630
2619 return batch; 2631 return batch;
2620 } 2632 }
2621 2633
2622 inline void setup_pageset(struct per_cpu_pageset *p, unsigned long batch) 2634 inline void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
2623 { 2635 {
2624 struct per_cpu_pages *pcp; 2636 struct per_cpu_pages *pcp;
2625 2637
2626 memset(p, 0, sizeof(*p)); 2638 memset(p, 0, sizeof(*p));
2627 2639
2628 pcp = &p->pcp; 2640 pcp = &p->pcp;
2629 pcp->count = 0; 2641 pcp->count = 0;
2630 pcp->high = 6 * batch; 2642 pcp->high = 6 * batch;
2631 pcp->batch = max(1UL, 1 * batch); 2643 pcp->batch = max(1UL, 1 * batch);
2632 INIT_LIST_HEAD(&pcp->list); 2644 INIT_LIST_HEAD(&pcp->list);
2633 } 2645 }
2634 2646
2635 /* 2647 /*
2636 * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist 2648 * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist
2637 * to the value high for the pageset p. 2649 * to the value high for the pageset p.
2638 */ 2650 */
2639 2651
2640 static void setup_pagelist_highmark(struct per_cpu_pageset *p, 2652 static void setup_pagelist_highmark(struct per_cpu_pageset *p,
2641 unsigned long high) 2653 unsigned long high)
2642 { 2654 {
2643 struct per_cpu_pages *pcp; 2655 struct per_cpu_pages *pcp;
2644 2656
2645 pcp = &p->pcp; 2657 pcp = &p->pcp;
2646 pcp->high = high; 2658 pcp->high = high;
2647 pcp->batch = max(1UL, high/4); 2659 pcp->batch = max(1UL, high/4);
2648 if ((high/4) > (PAGE_SHIFT * 8)) 2660 if ((high/4) > (PAGE_SHIFT * 8))
2649 pcp->batch = PAGE_SHIFT * 8; 2661 pcp->batch = PAGE_SHIFT * 8;
2650 } 2662 }
2651 2663
2652 2664
2653 #ifdef CONFIG_NUMA 2665 #ifdef CONFIG_NUMA
2654 /* 2666 /*
2655 * Boot pageset table. One per cpu which is going to be used for all 2667 * Boot pageset table. One per cpu which is going to be used for all
2656 * zones and all nodes. The parameters will be set in such a way 2668 * zones and all nodes. The parameters will be set in such a way
2657 * that an item put on a list will immediately be handed over to 2669 * that an item put on a list will immediately be handed over to
2658 * the buddy list. This is safe since pageset manipulation is done 2670 * the buddy list. This is safe since pageset manipulation is done
2659 * with interrupts disabled. 2671 * with interrupts disabled.
2660 * 2672 *
2661 * Some NUMA counter updates may also be caught by the boot pagesets. 2673 * Some NUMA counter updates may also be caught by the boot pagesets.
2662 * 2674 *
2663 * The boot_pagesets must be kept even after bootup is complete for 2675 * The boot_pagesets must be kept even after bootup is complete for
2664 * unused processors and/or zones. They do play a role for bootstrapping 2676 * unused processors and/or zones. They do play a role for bootstrapping
2665 * hotplugged processors. 2677 * hotplugged processors.
2666 * 2678 *
2667 * zoneinfo_show() and maybe other functions do 2679 * zoneinfo_show() and maybe other functions do
2668 * not check if the processor is online before following the pageset pointer. 2680 * not check if the processor is online before following the pageset pointer.
2669 * Other parts of the kernel may not check if the zone is available. 2681 * Other parts of the kernel may not check if the zone is available.
2670 */ 2682 */
2671 static struct per_cpu_pageset boot_pageset[NR_CPUS]; 2683 static struct per_cpu_pageset boot_pageset[NR_CPUS];
2672 2684
2673 /* 2685 /*
2674 * Dynamically allocate memory for the 2686 * Dynamically allocate memory for the
2675 * per cpu pageset array in struct zone. 2687 * per cpu pageset array in struct zone.
2676 */ 2688 */
2677 static int __cpuinit process_zones(int cpu) 2689 static int __cpuinit process_zones(int cpu)
2678 { 2690 {
2679 struct zone *zone, *dzone; 2691 struct zone *zone, *dzone;
2680 int node = cpu_to_node(cpu); 2692 int node = cpu_to_node(cpu);
2681 2693
2682 node_set_state(node, N_CPU); /* this node has a cpu */ 2694 node_set_state(node, N_CPU); /* this node has a cpu */
2683 2695
2684 for_each_zone(zone) { 2696 for_each_zone(zone) {
2685 2697
2686 if (!populated_zone(zone)) 2698 if (!populated_zone(zone))
2687 continue; 2699 continue;
2688 2700
2689 zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset), 2701 zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
2690 GFP_KERNEL, node); 2702 GFP_KERNEL, node);
2691 if (!zone_pcp(zone, cpu)) 2703 if (!zone_pcp(zone, cpu))
2692 goto bad; 2704 goto bad;
2693 2705
2694 setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone)); 2706 setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone));
2695 2707
2696 if (percpu_pagelist_fraction) 2708 if (percpu_pagelist_fraction)
2697 setup_pagelist_highmark(zone_pcp(zone, cpu), 2709 setup_pagelist_highmark(zone_pcp(zone, cpu),
2698 (zone->present_pages / percpu_pagelist_fraction)); 2710 (zone->present_pages / percpu_pagelist_fraction));
2699 } 2711 }
2700 2712
2701 return 0; 2713 return 0;
2702 bad: 2714 bad:
2703 for_each_zone(dzone) { 2715 for_each_zone(dzone) {
2704 if (!populated_zone(dzone)) 2716 if (!populated_zone(dzone))
2705 continue; 2717 continue;
2706 if (dzone == zone) 2718 if (dzone == zone)
2707 break; 2719 break;
2708 kfree(zone_pcp(dzone, cpu)); 2720 kfree(zone_pcp(dzone, cpu));
2709 zone_pcp(dzone, cpu) = NULL; 2721 zone_pcp(dzone, cpu) = NULL;
2710 } 2722 }
2711 return -ENOMEM; 2723 return -ENOMEM;
2712 } 2724 }
2713 2725
2714 static inline void free_zone_pagesets(int cpu) 2726 static inline void free_zone_pagesets(int cpu)
2715 { 2727 {
2716 struct zone *zone; 2728 struct zone *zone;
2717 2729
2718 for_each_zone(zone) { 2730 for_each_zone(zone) {
2719 struct per_cpu_pageset *pset = zone_pcp(zone, cpu); 2731 struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
2720 2732
2721 /* Free per_cpu_pageset if it is slab allocated */ 2733 /* Free per_cpu_pageset if it is slab allocated */
2722 if (pset != &boot_pageset[cpu]) 2734 if (pset != &boot_pageset[cpu])
2723 kfree(pset); 2735 kfree(pset);
2724 zone_pcp(zone, cpu) = NULL; 2736 zone_pcp(zone, cpu) = NULL;
2725 } 2737 }
2726 } 2738 }
2727 2739
2728 static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb, 2740 static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
2729 unsigned long action, 2741 unsigned long action,
2730 void *hcpu) 2742 void *hcpu)
2731 { 2743 {
2732 int cpu = (long)hcpu; 2744 int cpu = (long)hcpu;
2733 int ret = NOTIFY_OK; 2745 int ret = NOTIFY_OK;
2734 2746
2735 switch (action) { 2747 switch (action) {
2736 case CPU_UP_PREPARE: 2748 case CPU_UP_PREPARE:
2737 case CPU_UP_PREPARE_FROZEN: 2749 case CPU_UP_PREPARE_FROZEN:
2738 if (process_zones(cpu)) 2750 if (process_zones(cpu))
2739 ret = NOTIFY_BAD; 2751 ret = NOTIFY_BAD;
2740 break; 2752 break;
2741 case CPU_UP_CANCELED: 2753 case CPU_UP_CANCELED:
2742 case CPU_UP_CANCELED_FROZEN: 2754 case CPU_UP_CANCELED_FROZEN:
2743 case CPU_DEAD: 2755 case CPU_DEAD:
2744 case CPU_DEAD_FROZEN: 2756 case CPU_DEAD_FROZEN:
2745 free_zone_pagesets(cpu); 2757 free_zone_pagesets(cpu);
2746 break; 2758 break;
2747 default: 2759 default:
2748 break; 2760 break;
2749 } 2761 }
2750 return ret; 2762 return ret;
2751 } 2763 }
2752 2764
2753 static struct notifier_block __cpuinitdata pageset_notifier = 2765 static struct notifier_block __cpuinitdata pageset_notifier =
2754 { &pageset_cpuup_callback, NULL, 0 }; 2766 { &pageset_cpuup_callback, NULL, 0 };
2755 2767
2756 void __init setup_per_cpu_pageset(void) 2768 void __init setup_per_cpu_pageset(void)
2757 { 2769 {
2758 int err; 2770 int err;
2759 2771
2760 /* Initialize per_cpu_pageset for cpu 0. 2772 /* Initialize per_cpu_pageset for cpu 0.
2761 * A cpuup callback will do this for every cpu 2773 * A cpuup callback will do this for every cpu
2762 * as it comes online 2774 * as it comes online
2763 */ 2775 */
2764 err = process_zones(smp_processor_id()); 2776 err = process_zones(smp_processor_id());
2765 BUG_ON(err); 2777 BUG_ON(err);
2766 register_cpu_notifier(&pageset_notifier); 2778 register_cpu_notifier(&pageset_notifier);
2767 } 2779 }
2768 2780
2769 #endif 2781 #endif
2770 2782
2771 static noinline __init_refok 2783 static noinline __init_refok
2772 int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages) 2784 int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
2773 { 2785 {
2774 int i; 2786 int i;
2775 struct pglist_data *pgdat = zone->zone_pgdat; 2787 struct pglist_data *pgdat = zone->zone_pgdat;
2776 size_t alloc_size; 2788 size_t alloc_size;
2777 2789
2778 /* 2790 /*
2779 * The per-page waitqueue mechanism uses hashed waitqueues 2791 * The per-page waitqueue mechanism uses hashed waitqueues
2780 * per zone. 2792 * per zone.
2781 */ 2793 */
2782 zone->wait_table_hash_nr_entries = 2794 zone->wait_table_hash_nr_entries =
2783 wait_table_hash_nr_entries(zone_size_pages); 2795 wait_table_hash_nr_entries(zone_size_pages);
2784 zone->wait_table_bits = 2796 zone->wait_table_bits =
2785 wait_table_bits(zone->wait_table_hash_nr_entries); 2797 wait_table_bits(zone->wait_table_hash_nr_entries);
2786 alloc_size = zone->wait_table_hash_nr_entries 2798 alloc_size = zone->wait_table_hash_nr_entries
2787 * sizeof(wait_queue_head_t); 2799 * sizeof(wait_queue_head_t);
2788 2800
2789 if (system_state == SYSTEM_BOOTING) { 2801 if (system_state == SYSTEM_BOOTING) {
2790 zone->wait_table = (wait_queue_head_t *) 2802 zone->wait_table = (wait_queue_head_t *)
2791 alloc_bootmem_node(pgdat, alloc_size); 2803 alloc_bootmem_node(pgdat, alloc_size);
2792 } else { 2804 } else {
2793 /* 2805 /*
2794 * This case means that a zone whose size was 0 gets new memory 2806 * This case means that a zone whose size was 0 gets new memory
2795 * via memory hot-add. 2807 * via memory hot-add.
2796 * But it may be the case that a new node was hot-added. In 2808 * But it may be the case that a new node was hot-added. In
2797 * this case vmalloc() will not be able to use this new node's 2809 * this case vmalloc() will not be able to use this new node's
2798 * memory - this wait_table must be initialized to use this new 2810 * memory - this wait_table must be initialized to use this new
2799 * node itself as well. 2811 * node itself as well.
2800 * To use this new node's memory, further consideration will be 2812 * To use this new node's memory, further consideration will be
2801 * necessary. 2813 * necessary.
2802 */ 2814 */
2803 zone->wait_table = vmalloc(alloc_size); 2815 zone->wait_table = vmalloc(alloc_size);
2804 } 2816 }
2805 if (!zone->wait_table) 2817 if (!zone->wait_table)
2806 return -ENOMEM; 2818 return -ENOMEM;
2807 2819
2808 for(i = 0; i < zone->wait_table_hash_nr_entries; ++i) 2820 for(i = 0; i < zone->wait_table_hash_nr_entries; ++i)
2809 init_waitqueue_head(zone->wait_table + i); 2821 init_waitqueue_head(zone->wait_table + i);
2810 2822
2811 return 0; 2823 return 0;
2812 } 2824 }
2813 2825
2814 static __meminit void zone_pcp_init(struct zone *zone) 2826 static __meminit void zone_pcp_init(struct zone *zone)
2815 { 2827 {
2816 int cpu; 2828 int cpu;
2817 unsigned long batch = zone_batchsize(zone); 2829 unsigned long batch = zone_batchsize(zone);
2818 2830
2819 for (cpu = 0; cpu < NR_CPUS; cpu++) { 2831 for (cpu = 0; cpu < NR_CPUS; cpu++) {
2820 #ifdef CONFIG_NUMA 2832 #ifdef CONFIG_NUMA
2821 /* Early boot. Slab allocator not functional yet */ 2833 /* Early boot. Slab allocator not functional yet */
2822 zone_pcp(zone, cpu) = &boot_pageset[cpu]; 2834 zone_pcp(zone, cpu) = &boot_pageset[cpu];
2823 setup_pageset(&boot_pageset[cpu],0); 2835 setup_pageset(&boot_pageset[cpu],0);
2824 #else 2836 #else
2825 setup_pageset(zone_pcp(zone,cpu), batch); 2837 setup_pageset(zone_pcp(zone,cpu), batch);
2826 #endif 2838 #endif
2827 } 2839 }
2828 if (zone->present_pages) 2840 if (zone->present_pages)
2829 printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%lu\n", 2841 printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%lu\n",
2830 zone->name, zone->present_pages, batch); 2842 zone->name, zone->present_pages, batch);
2831 } 2843 }
2832 2844
2833 __meminit int init_currently_empty_zone(struct zone *zone, 2845 __meminit int init_currently_empty_zone(struct zone *zone,
2834 unsigned long zone_start_pfn, 2846 unsigned long zone_start_pfn,
2835 unsigned long size, 2847 unsigned long size,
2836 enum memmap_context context) 2848 enum memmap_context context)
2837 { 2849 {
2838 struct pglist_data *pgdat = zone->zone_pgdat; 2850 struct pglist_data *pgdat = zone->zone_pgdat;
2839 int ret; 2851 int ret;
2840 ret = zone_wait_table_init(zone, size); 2852 ret = zone_wait_table_init(zone, size);
2841 if (ret) 2853 if (ret)
2842 return ret; 2854 return ret;
2843 pgdat->nr_zones = zone_idx(zone) + 1; 2855 pgdat->nr_zones = zone_idx(zone) + 1;
2844 2856
2845 zone->zone_start_pfn = zone_start_pfn; 2857 zone->zone_start_pfn = zone_start_pfn;
2846 2858
2847 memmap_init(size, pgdat->node_id, zone_idx(zone), zone_start_pfn); 2859 memmap_init(size, pgdat->node_id, zone_idx(zone), zone_start_pfn);
2848 2860
2849 zone_init_free_lists(zone); 2861 zone_init_free_lists(zone);
2850 2862
2851 return 0; 2863 return 0;
2852 } 2864 }
2853 2865
2854 #ifdef CONFIG_ARCH_POPULATES_NODE_MAP 2866 #ifdef CONFIG_ARCH_POPULATES_NODE_MAP
2855 /* 2867 /*
2856 * Basic iterator support. Return the first range of PFNs for a node 2868 * Basic iterator support. Return the first range of PFNs for a node
2857 * Note: nid == MAX_NUMNODES returns first region regardless of node 2869 * Note: nid == MAX_NUMNODES returns first region regardless of node
2858 */ 2870 */
2859 static int __meminit first_active_region_index_in_nid(int nid) 2871 static int __meminit first_active_region_index_in_nid(int nid)
2860 { 2872 {
2861 int i; 2873 int i;
2862 2874
2863 for (i = 0; i < nr_nodemap_entries; i++) 2875 for (i = 0; i < nr_nodemap_entries; i++)
2864 if (nid == MAX_NUMNODES || early_node_map[i].nid == nid) 2876 if (nid == MAX_NUMNODES || early_node_map[i].nid == nid)
2865 return i; 2877 return i;
2866 2878
2867 return -1; 2879 return -1;
2868 } 2880 }
2869 2881
2870 /* 2882 /*
2871 * Basic iterator support. Return the next active range of PFNs for a node 2883 * Basic iterator support. Return the next active range of PFNs for a node
2872 * Note: nid == MAX_NUMNODES returns next region regardless of node 2884 * Note: nid == MAX_NUMNODES returns next region regardless of node
2873 */ 2885 */
2874 static int __meminit next_active_region_index_in_nid(int index, int nid) 2886 static int __meminit next_active_region_index_in_nid(int index, int nid)
2875 { 2887 {
2876 for (index = index + 1; index < nr_nodemap_entries; index++) 2888 for (index = index + 1; index < nr_nodemap_entries; index++)
2877 if (nid == MAX_NUMNODES || early_node_map[index].nid == nid) 2889 if (nid == MAX_NUMNODES || early_node_map[index].nid == nid)
2878 return index; 2890 return index;
2879 2891
2880 return -1; 2892 return -1;
2881 } 2893 }
2882 2894
2883 #ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID 2895 #ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
2884 /* 2896 /*
2885 * Required by SPARSEMEM. Given a PFN, return what node the PFN is on. 2897 * Required by SPARSEMEM. Given a PFN, return what node the PFN is on.
2886 * Architectures may implement their own version but if add_active_range() 2898 * Architectures may implement their own version but if add_active_range()
2887 * was used and there are no special requirements, this is a convenient 2899 * was used and there are no special requirements, this is a convenient
2888 * alternative 2900 * alternative
2889 */ 2901 */
2890 int __meminit early_pfn_to_nid(unsigned long pfn) 2902 int __meminit early_pfn_to_nid(unsigned long pfn)
2891 { 2903 {
2892 int i; 2904 int i;
2893 2905
2894 for (i = 0; i < nr_nodemap_entries; i++) { 2906 for (i = 0; i < nr_nodemap_entries; i++) {
2895 unsigned long start_pfn = early_node_map[i].start_pfn; 2907 unsigned long start_pfn = early_node_map[i].start_pfn;
2896 unsigned long end_pfn = early_node_map[i].end_pfn; 2908 unsigned long end_pfn = early_node_map[i].end_pfn;
2897 2909
2898 if (start_pfn <= pfn && pfn < end_pfn) 2910 if (start_pfn <= pfn && pfn < end_pfn)
2899 return early_node_map[i].nid; 2911 return early_node_map[i].nid;
2900 } 2912 }
2901 2913
2902 return 0; 2914 return 0;
2903 } 2915 }
2904 #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */ 2916 #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
2905 2917
2906 /* Basic iterator support to walk early_node_map[] */ 2918 /* Basic iterator support to walk early_node_map[] */
2907 #define for_each_active_range_index_in_nid(i, nid) \ 2919 #define for_each_active_range_index_in_nid(i, nid) \
2908 for (i = first_active_region_index_in_nid(nid); i != -1; \ 2920 for (i = first_active_region_index_in_nid(nid); i != -1; \
2909 i = next_active_region_index_in_nid(i, nid)) 2921 i = next_active_region_index_in_nid(i, nid))
2910 2922
2911 /** 2923 /**
2912 * free_bootmem_with_active_regions - Call free_bootmem_node for each active range 2924 * free_bootmem_with_active_regions - Call free_bootmem_node for each active range
2913 * @nid: The node to free memory on. If MAX_NUMNODES, all nodes are freed. 2925 * @nid: The node to free memory on. If MAX_NUMNODES, all nodes are freed.
2914 * @max_low_pfn: The highest PFN that will be passed to free_bootmem_node 2926 * @max_low_pfn: The highest PFN that will be passed to free_bootmem_node
2915 * 2927 *
2916 * If an architecture guarantees that all ranges registered with 2928 * If an architecture guarantees that all ranges registered with
2917 * add_active_ranges() contain no holes and may be freed, this 2929 * add_active_ranges() contain no holes and may be freed, this
2918 * this function may be used instead of calling free_bootmem() manually. 2930 * this function may be used instead of calling free_bootmem() manually.
2919 */ 2931 */
2920 void __init free_bootmem_with_active_regions(int nid, 2932 void __init free_bootmem_with_active_regions(int nid,
2921 unsigned long max_low_pfn) 2933 unsigned long max_low_pfn)
2922 { 2934 {
2923 int i; 2935 int i;
2924 2936
2925 for_each_active_range_index_in_nid(i, nid) { 2937 for_each_active_range_index_in_nid(i, nid) {
2926 unsigned long size_pages = 0; 2938 unsigned long size_pages = 0;
2927 unsigned long end_pfn = early_node_map[i].end_pfn; 2939 unsigned long end_pfn = early_node_map[i].end_pfn;
2928 2940
2929 if (early_node_map[i].start_pfn >= max_low_pfn) 2941 if (early_node_map[i].start_pfn >= max_low_pfn)
2930 continue; 2942 continue;
2931 2943
2932 if (end_pfn > max_low_pfn) 2944 if (end_pfn > max_low_pfn)
2933 end_pfn = max_low_pfn; 2945 end_pfn = max_low_pfn;
2934 2946
2935 size_pages = end_pfn - early_node_map[i].start_pfn; 2947 size_pages = end_pfn - early_node_map[i].start_pfn;
2936 free_bootmem_node(NODE_DATA(early_node_map[i].nid), 2948 free_bootmem_node(NODE_DATA(early_node_map[i].nid),
2937 PFN_PHYS(early_node_map[i].start_pfn), 2949 PFN_PHYS(early_node_map[i].start_pfn),
2938 size_pages << PAGE_SHIFT); 2950 size_pages << PAGE_SHIFT);
2939 } 2951 }
2940 } 2952 }
2941 2953
2942 /** 2954 /**
2943 * sparse_memory_present_with_active_regions - Call memory_present for each active range 2955 * sparse_memory_present_with_active_regions - Call memory_present for each active range
2944 * @nid: The node to call memory_present for. If MAX_NUMNODES, all nodes will be used. 2956 * @nid: The node to call memory_present for. If MAX_NUMNODES, all nodes will be used.
2945 * 2957 *
2946 * If an architecture guarantees that all ranges registered with 2958 * If an architecture guarantees that all ranges registered with
2947 * add_active_ranges() contain no holes and may be freed, this 2959 * add_active_ranges() contain no holes and may be freed, this
2948 * function may be used instead of calling memory_present() manually. 2960 * function may be used instead of calling memory_present() manually.
2949 */ 2961 */
2950 void __init sparse_memory_present_with_active_regions(int nid) 2962 void __init sparse_memory_present_with_active_regions(int nid)
2951 { 2963 {
2952 int i; 2964 int i;
2953 2965
2954 for_each_active_range_index_in_nid(i, nid) 2966 for_each_active_range_index_in_nid(i, nid)
2955 memory_present(early_node_map[i].nid, 2967 memory_present(early_node_map[i].nid,
2956 early_node_map[i].start_pfn, 2968 early_node_map[i].start_pfn,
2957 early_node_map[i].end_pfn); 2969 early_node_map[i].end_pfn);
2958 } 2970 }
2959 2971
2960 /** 2972 /**
2961 * push_node_boundaries - Push node boundaries to at least the requested boundary 2973 * push_node_boundaries - Push node boundaries to at least the requested boundary
2962 * @nid: The nid of the node to push the boundary for 2974 * @nid: The nid of the node to push the boundary for
2963 * @start_pfn: The start pfn of the node 2975 * @start_pfn: The start pfn of the node
2964 * @end_pfn: The end pfn of the node 2976 * @end_pfn: The end pfn of the node
2965 * 2977 *
2966 * In reserve-based hot-add, mem_map is allocated that is unused until hotadd 2978 * In reserve-based hot-add, mem_map is allocated that is unused until hotadd
2967 * time. Specifically, on x86_64, SRAT will report ranges that can potentially 2979 * time. Specifically, on x86_64, SRAT will report ranges that can potentially
2968 * be hotplugged even though no physical memory exists. This function allows 2980 * be hotplugged even though no physical memory exists. This function allows
2969 * an arch to push out the node boundaries so mem_map is allocated that can 2981 * an arch to push out the node boundaries so mem_map is allocated that can
2970 * be used later. 2982 * be used later.
2971 */ 2983 */
2972 #ifdef CONFIG_MEMORY_HOTPLUG_RESERVE 2984 #ifdef CONFIG_MEMORY_HOTPLUG_RESERVE
2973 void __init push_node_boundaries(unsigned int nid, 2985 void __init push_node_boundaries(unsigned int nid,
2974 unsigned long start_pfn, unsigned long end_pfn) 2986 unsigned long start_pfn, unsigned long end_pfn)
2975 { 2987 {
2976 printk(KERN_DEBUG "Entering push_node_boundaries(%u, %lu, %lu)\n", 2988 printk(KERN_DEBUG "Entering push_node_boundaries(%u, %lu, %lu)\n",
2977 nid, start_pfn, end_pfn); 2989 nid, start_pfn, end_pfn);
2978 2990
2979 /* Initialise the boundary for this node if necessary */ 2991 /* Initialise the boundary for this node if necessary */
2980 if (node_boundary_end_pfn[nid] == 0) 2992 if (node_boundary_end_pfn[nid] == 0)
2981 node_boundary_start_pfn[nid] = -1UL; 2993 node_boundary_start_pfn[nid] = -1UL;
2982 2994
2983 /* Update the boundaries */ 2995 /* Update the boundaries */
2984 if (node_boundary_start_pfn[nid] > start_pfn) 2996 if (node_boundary_start_pfn[nid] > start_pfn)
2985 node_boundary_start_pfn[nid] = start_pfn; 2997 node_boundary_start_pfn[nid] = start_pfn;
2986 if (node_boundary_end_pfn[nid] < end_pfn) 2998 if (node_boundary_end_pfn[nid] < end_pfn)
2987 node_boundary_end_pfn[nid] = end_pfn; 2999 node_boundary_end_pfn[nid] = end_pfn;
2988 } 3000 }
2989 3001
2990 /* If necessary, push the node boundary out for reserve hotadd */ 3002 /* If necessary, push the node boundary out for reserve hotadd */
2991 static void __meminit account_node_boundary(unsigned int nid, 3003 static void __meminit account_node_boundary(unsigned int nid,
2992 unsigned long *start_pfn, unsigned long *end_pfn) 3004 unsigned long *start_pfn, unsigned long *end_pfn)
2993 { 3005 {
2994 printk(KERN_DEBUG "Entering account_node_boundary(%u, %lu, %lu)\n", 3006 printk(KERN_DEBUG "Entering account_node_boundary(%u, %lu, %lu)\n",
2995 nid, *start_pfn, *end_pfn); 3007 nid, *start_pfn, *end_pfn);
2996 3008
2997 /* Return if boundary information has not been provided */ 3009 /* Return if boundary information has not been provided */
2998 if (node_boundary_end_pfn[nid] == 0) 3010 if (node_boundary_end_pfn[nid] == 0)
2999 return; 3011 return;
3000 3012
3001 /* Check the boundaries and update if necessary */ 3013 /* Check the boundaries and update if necessary */
3002 if (node_boundary_start_pfn[nid] < *start_pfn) 3014 if (node_boundary_start_pfn[nid] < *start_pfn)
3003 *start_pfn = node_boundary_start_pfn[nid]; 3015 *start_pfn = node_boundary_start_pfn[nid];
3004 if (node_boundary_end_pfn[nid] > *end_pfn) 3016 if (node_boundary_end_pfn[nid] > *end_pfn)
3005 *end_pfn = node_boundary_end_pfn[nid]; 3017 *end_pfn = node_boundary_end_pfn[nid];
3006 } 3018 }
3007 #else 3019 #else
3008 void __init push_node_boundaries(unsigned int nid, 3020 void __init push_node_boundaries(unsigned int nid,
3009 unsigned long start_pfn, unsigned long end_pfn) {} 3021 unsigned long start_pfn, unsigned long end_pfn) {}
3010 3022
3011 static void __meminit account_node_boundary(unsigned int nid, 3023 static void __meminit account_node_boundary(unsigned int nid,
3012 unsigned long *start_pfn, unsigned long *end_pfn) {} 3024 unsigned long *start_pfn, unsigned long *end_pfn) {}
3013 #endif 3025 #endif
3014 3026
3015 3027
3016 /** 3028 /**
3017 * get_pfn_range_for_nid - Return the start and end page frames for a node 3029 * get_pfn_range_for_nid - Return the start and end page frames for a node
3018 * @nid: The nid to return the range for. If MAX_NUMNODES, the min and max PFN are returned. 3030 * @nid: The nid to return the range for. If MAX_NUMNODES, the min and max PFN are returned.
3019 * @start_pfn: Passed by reference. On return, it will have the node start_pfn. 3031 * @start_pfn: Passed by reference. On return, it will have the node start_pfn.
3020 * @end_pfn: Passed by reference. On return, it will have the node end_pfn. 3032 * @end_pfn: Passed by reference. On return, it will have the node end_pfn.
3021 * 3033 *
3022 * It returns the start and end page frame of a node based on information 3034 * It returns the start and end page frame of a node based on information
3023 * provided by an arch calling add_active_range(). If called for a node 3035 * provided by an arch calling add_active_range(). If called for a node
3024 * with no available memory, a warning is printed and the start and end 3036 * with no available memory, a warning is printed and the start and end
3025 * PFNs will be 0. 3037 * PFNs will be 0.
3026 */ 3038 */
3027 void __meminit get_pfn_range_for_nid(unsigned int nid, 3039 void __meminit get_pfn_range_for_nid(unsigned int nid,
3028 unsigned long *start_pfn, unsigned long *end_pfn) 3040 unsigned long *start_pfn, unsigned long *end_pfn)
3029 { 3041 {
3030 int i; 3042 int i;
3031 *start_pfn = -1UL; 3043 *start_pfn = -1UL;
3032 *end_pfn = 0; 3044 *end_pfn = 0;
3033 3045
3034 for_each_active_range_index_in_nid(i, nid) { 3046 for_each_active_range_index_in_nid(i, nid) {
3035 *start_pfn = min(*start_pfn, early_node_map[i].start_pfn); 3047 *start_pfn = min(*start_pfn, early_node_map[i].start_pfn);
3036 *end_pfn = max(*end_pfn, early_node_map[i].end_pfn); 3048 *end_pfn = max(*end_pfn, early_node_map[i].end_pfn);
3037 } 3049 }
3038 3050
3039 if (*start_pfn == -1UL) 3051 if (*start_pfn == -1UL)
3040 *start_pfn = 0; 3052 *start_pfn = 0;
3041 3053
3042 /* Push the node boundaries out if requested */ 3054 /* Push the node boundaries out if requested */
3043 account_node_boundary(nid, start_pfn, end_pfn); 3055 account_node_boundary(nid, start_pfn, end_pfn);
3044 } 3056 }
3045 3057
3046 /* 3058 /*
3047 * This finds a zone that can be used for ZONE_MOVABLE pages. The 3059 * This finds a zone that can be used for ZONE_MOVABLE pages. The
3048 * assumption is made that zones within a node are ordered in monotonic 3060 * assumption is made that zones within a node are ordered in monotonic
3049 * increasing memory addresses so that the "highest" populated zone is used 3061 * increasing memory addresses so that the "highest" populated zone is used
3050 */ 3062 */
3051 void __init find_usable_zone_for_movable(void) 3063 void __init find_usable_zone_for_movable(void)
3052 { 3064 {
3053 int zone_index; 3065 int zone_index;
3054 for (zone_index = MAX_NR_ZONES - 1; zone_index >= 0; zone_index--) { 3066 for (zone_index = MAX_NR_ZONES - 1; zone_index >= 0; zone_index--) {
3055 if (zone_index == ZONE_MOVABLE) 3067 if (zone_index == ZONE_MOVABLE)
3056 continue; 3068 continue;
3057 3069
3058 if (arch_zone_highest_possible_pfn[zone_index] > 3070 if (arch_zone_highest_possible_pfn[zone_index] >
3059 arch_zone_lowest_possible_pfn[zone_index]) 3071 arch_zone_lowest_possible_pfn[zone_index])
3060 break; 3072 break;
3061 } 3073 }
3062 3074
3063 VM_BUG_ON(zone_index == -1); 3075 VM_BUG_ON(zone_index == -1);
3064 movable_zone = zone_index; 3076 movable_zone = zone_index;
3065 } 3077 }
3066 3078
3067 /* 3079 /*
3068 * The zone ranges provided by the architecture do not include ZONE_MOVABLE 3080 * The zone ranges provided by the architecture do not include ZONE_MOVABLE
3069 * because it is sized independant of architecture. Unlike the other zones, 3081 * because it is sized independant of architecture. Unlike the other zones,
3070 * the starting point for ZONE_MOVABLE is not fixed. It may be different 3082 * the starting point for ZONE_MOVABLE is not fixed. It may be different
3071 * in each node depending on the size of each node and how evenly kernelcore 3083 * in each node depending on the size of each node and how evenly kernelcore
3072 * is distributed. This helper function adjusts the zone ranges 3084 * is distributed. This helper function adjusts the zone ranges
3073 * provided by the architecture for a given node by using the end of the 3085 * provided by the architecture for a given node by using the end of the
3074 * highest usable zone for ZONE_MOVABLE. This preserves the assumption that 3086 * highest usable zone for ZONE_MOVABLE. This preserves the assumption that
3075 * zones within a node are in order of monotonic increases memory addresses 3087 * zones within a node are in order of monotonic increases memory addresses
3076 */ 3088 */
3077 void __meminit adjust_zone_range_for_zone_movable(int nid, 3089 void __meminit adjust_zone_range_for_zone_movable(int nid,
3078 unsigned long zone_type, 3090 unsigned long zone_type,
3079 unsigned long node_start_pfn, 3091 unsigned long node_start_pfn,
3080 unsigned long node_end_pfn, 3092 unsigned long node_end_pfn,
3081 unsigned long *zone_start_pfn, 3093 unsigned long *zone_start_pfn,
3082 unsigned long *zone_end_pfn) 3094 unsigned long *zone_end_pfn)
3083 { 3095 {
3084 /* Only adjust if ZONE_MOVABLE is on this node */ 3096 /* Only adjust if ZONE_MOVABLE is on this node */
3085 if (zone_movable_pfn[nid]) { 3097 if (zone_movable_pfn[nid]) {
3086 /* Size ZONE_MOVABLE */ 3098 /* Size ZONE_MOVABLE */
3087 if (zone_type == ZONE_MOVABLE) { 3099 if (zone_type == ZONE_MOVABLE) {
3088 *zone_start_pfn = zone_movable_pfn[nid]; 3100 *zone_start_pfn = zone_movable_pfn[nid];
3089 *zone_end_pfn = min(node_end_pfn, 3101 *zone_end_pfn = min(node_end_pfn,
3090 arch_zone_highest_possible_pfn[movable_zone]); 3102 arch_zone_highest_possible_pfn[movable_zone]);
3091 3103
3092 /* Adjust for ZONE_MOVABLE starting within this range */ 3104 /* Adjust for ZONE_MOVABLE starting within this range */
3093 } else if (*zone_start_pfn < zone_movable_pfn[nid] && 3105 } else if (*zone_start_pfn < zone_movable_pfn[nid] &&
3094 *zone_end_pfn > zone_movable_pfn[nid]) { 3106 *zone_end_pfn > zone_movable_pfn[nid]) {
3095 *zone_end_pfn = zone_movable_pfn[nid]; 3107 *zone_end_pfn = zone_movable_pfn[nid];
3096 3108
3097 /* Check if this whole range is within ZONE_MOVABLE */ 3109 /* Check if this whole range is within ZONE_MOVABLE */
3098 } else if (*zone_start_pfn >= zone_movable_pfn[nid]) 3110 } else if (*zone_start_pfn >= zone_movable_pfn[nid])
3099 *zone_start_pfn = *zone_end_pfn; 3111 *zone_start_pfn = *zone_end_pfn;
3100 } 3112 }
3101 } 3113 }
3102 3114
3103 /* 3115 /*
3104 * Return the number of pages a zone spans in a node, including holes 3116 * Return the number of pages a zone spans in a node, including holes
3105 * present_pages = zone_spanned_pages_in_node() - zone_absent_pages_in_node() 3117 * present_pages = zone_spanned_pages_in_node() - zone_absent_pages_in_node()
3106 */ 3118 */
3107 static unsigned long __meminit zone_spanned_pages_in_node(int nid, 3119 static unsigned long __meminit zone_spanned_pages_in_node(int nid,
3108 unsigned long zone_type, 3120 unsigned long zone_type,
3109 unsigned long *ignored) 3121 unsigned long *ignored)
3110 { 3122 {
3111 unsigned long node_start_pfn, node_end_pfn; 3123 unsigned long node_start_pfn, node_end_pfn;
3112 unsigned long zone_start_pfn, zone_end_pfn; 3124 unsigned long zone_start_pfn, zone_end_pfn;
3113 3125
3114 /* Get the start and end of the node and zone */ 3126 /* Get the start and end of the node and zone */
3115 get_pfn_range_for_nid(nid, &node_start_pfn, &node_end_pfn); 3127 get_pfn_range_for_nid(nid, &node_start_pfn, &node_end_pfn);
3116 zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type]; 3128 zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
3117 zone_end_pfn = arch_zone_highest_possible_pfn[zone_type]; 3129 zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
3118 adjust_zone_range_for_zone_movable(nid, zone_type, 3130 adjust_zone_range_for_zone_movable(nid, zone_type,
3119 node_start_pfn, node_end_pfn, 3131 node_start_pfn, node_end_pfn,
3120 &zone_start_pfn, &zone_end_pfn); 3132 &zone_start_pfn, &zone_end_pfn);
3121 3133
3122 /* Check that this node has pages within the zone's required range */ 3134 /* Check that this node has pages within the zone's required range */
3123 if (zone_end_pfn < node_start_pfn || zone_start_pfn > node_end_pfn) 3135 if (zone_end_pfn < node_start_pfn || zone_start_pfn > node_end_pfn)
3124 return 0; 3136 return 0;
3125 3137
3126 /* Move the zone boundaries inside the node if necessary */ 3138 /* Move the zone boundaries inside the node if necessary */
3127 zone_end_pfn = min(zone_end_pfn, node_end_pfn); 3139 zone_end_pfn = min(zone_end_pfn, node_end_pfn);
3128 zone_start_pfn = max(zone_start_pfn, node_start_pfn); 3140 zone_start_pfn = max(zone_start_pfn, node_start_pfn);
3129 3141
3130 /* Return the spanned pages */ 3142 /* Return the spanned pages */
3131 return zone_end_pfn - zone_start_pfn; 3143 return zone_end_pfn - zone_start_pfn;
3132 } 3144 }
3133 3145
3134 /* 3146 /*
3135 * Return the number of holes in a range on a node. If nid is MAX_NUMNODES, 3147 * Return the number of holes in a range on a node. If nid is MAX_NUMNODES,
3136 * then all holes in the requested range will be accounted for. 3148 * then all holes in the requested range will be accounted for.
3137 */ 3149 */
3138 unsigned long __meminit __absent_pages_in_range(int nid, 3150 unsigned long __meminit __absent_pages_in_range(int nid,
3139 unsigned long range_start_pfn, 3151 unsigned long range_start_pfn,
3140 unsigned long range_end_pfn) 3152 unsigned long range_end_pfn)
3141 { 3153 {
3142 int i = 0; 3154 int i = 0;
3143 unsigned long prev_end_pfn = 0, hole_pages = 0; 3155 unsigned long prev_end_pfn = 0, hole_pages = 0;
3144 unsigned long start_pfn; 3156 unsigned long start_pfn;
3145 3157
3146 /* Find the end_pfn of the first active range of pfns in the node */ 3158 /* Find the end_pfn of the first active range of pfns in the node */
3147 i = first_active_region_index_in_nid(nid); 3159 i = first_active_region_index_in_nid(nid);
3148 if (i == -1) 3160 if (i == -1)
3149 return 0; 3161 return 0;
3150 3162
3151 prev_end_pfn = min(early_node_map[i].start_pfn, range_end_pfn); 3163 prev_end_pfn = min(early_node_map[i].start_pfn, range_end_pfn);
3152 3164
3153 /* Account for ranges before physical memory on this node */ 3165 /* Account for ranges before physical memory on this node */
3154 if (early_node_map[i].start_pfn > range_start_pfn) 3166 if (early_node_map[i].start_pfn > range_start_pfn)
3155 hole_pages = prev_end_pfn - range_start_pfn; 3167 hole_pages = prev_end_pfn - range_start_pfn;
3156 3168
3157 /* Find all holes for the zone within the node */ 3169 /* Find all holes for the zone within the node */
3158 for (; i != -1; i = next_active_region_index_in_nid(i, nid)) { 3170 for (; i != -1; i = next_active_region_index_in_nid(i, nid)) {
3159 3171
3160 /* No need to continue if prev_end_pfn is outside the zone */ 3172 /* No need to continue if prev_end_pfn is outside the zone */
3161 if (prev_end_pfn >= range_end_pfn) 3173 if (prev_end_pfn >= range_end_pfn)
3162 break; 3174 break;
3163 3175
3164 /* Make sure the end of the zone is not within the hole */ 3176 /* Make sure the end of the zone is not within the hole */
3165 start_pfn = min(early_node_map[i].start_pfn, range_end_pfn); 3177 start_pfn = min(early_node_map[i].start_pfn, range_end_pfn);
3166 prev_end_pfn = max(prev_end_pfn, range_start_pfn); 3178 prev_end_pfn = max(prev_end_pfn, range_start_pfn);
3167 3179
3168 /* Update the hole size cound and move on */ 3180 /* Update the hole size cound and move on */
3169 if (start_pfn > range_start_pfn) { 3181 if (start_pfn > range_start_pfn) {
3170 BUG_ON(prev_end_pfn > start_pfn); 3182 BUG_ON(prev_end_pfn > start_pfn);
3171 hole_pages += start_pfn - prev_end_pfn; 3183 hole_pages += start_pfn - prev_end_pfn;
3172 } 3184 }
3173 prev_end_pfn = early_node_map[i].end_pfn; 3185 prev_end_pfn = early_node_map[i].end_pfn;
3174 } 3186 }
3175 3187
3176 /* Account for ranges past physical memory on this node */ 3188 /* Account for ranges past physical memory on this node */
3177 if (range_end_pfn > prev_end_pfn) 3189 if (range_end_pfn > prev_end_pfn)
3178 hole_pages += range_end_pfn - 3190 hole_pages += range_end_pfn -
3179 max(range_start_pfn, prev_end_pfn); 3191 max(range_start_pfn, prev_end_pfn);
3180 3192
3181 return hole_pages; 3193 return hole_pages;
3182 } 3194 }
3183 3195
3184 /** 3196 /**
3185 * absent_pages_in_range - Return number of page frames in holes within a range 3197 * absent_pages_in_range - Return number of page frames in holes within a range
3186 * @start_pfn: The start PFN to start searching for holes 3198 * @start_pfn: The start PFN to start searching for holes
3187 * @end_pfn: The end PFN to stop searching for holes 3199 * @end_pfn: The end PFN to stop searching for holes
3188 * 3200 *
3189 * It returns the number of pages frames in memory holes within a range. 3201 * It returns the number of pages frames in memory holes within a range.
3190 */ 3202 */
3191 unsigned long __init absent_pages_in_range(unsigned long start_pfn, 3203 unsigned long __init absent_pages_in_range(unsigned long start_pfn,
3192 unsigned long end_pfn) 3204 unsigned long end_pfn)
3193 { 3205 {
3194 return __absent_pages_in_range(MAX_NUMNODES, start_pfn, end_pfn); 3206 return __absent_pages_in_range(MAX_NUMNODES, start_pfn, end_pfn);
3195 } 3207 }
3196 3208
3197 /* Return the number of page frames in holes in a zone on a node */ 3209 /* Return the number of page frames in holes in a zone on a node */
3198 static unsigned long __meminit zone_absent_pages_in_node(int nid, 3210 static unsigned long __meminit zone_absent_pages_in_node(int nid,
3199 unsigned long zone_type, 3211 unsigned long zone_type,
3200 unsigned long *ignored) 3212 unsigned long *ignored)
3201 { 3213 {
3202 unsigned long node_start_pfn, node_end_pfn; 3214 unsigned long node_start_pfn, node_end_pfn;
3203 unsigned long zone_start_pfn, zone_end_pfn; 3215 unsigned long zone_start_pfn, zone_end_pfn;
3204 3216
3205 get_pfn_range_for_nid(nid, &node_start_pfn, &node_end_pfn); 3217 get_pfn_range_for_nid(nid, &node_start_pfn, &node_end_pfn);
3206 zone_start_pfn = max(arch_zone_lowest_possible_pfn[zone_type], 3218 zone_start_pfn = max(arch_zone_lowest_possible_pfn[zone_type],
3207 node_start_pfn); 3219 node_start_pfn);
3208 zone_end_pfn = min(arch_zone_highest_possible_pfn[zone_type], 3220 zone_end_pfn = min(arch_zone_highest_possible_pfn[zone_type],
3209 node_end_pfn); 3221 node_end_pfn);
3210 3222
3211 adjust_zone_range_for_zone_movable(nid, zone_type, 3223 adjust_zone_range_for_zone_movable(nid, zone_type,
3212 node_start_pfn, node_end_pfn, 3224 node_start_pfn, node_end_pfn,
3213 &zone_start_pfn, &zone_end_pfn); 3225 &zone_start_pfn, &zone_end_pfn);
3214 return __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn); 3226 return __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn);
3215 } 3227 }
3216 3228
3217 #else 3229 #else
3218 static inline unsigned long __meminit zone_spanned_pages_in_node(int nid, 3230 static inline unsigned long __meminit zone_spanned_pages_in_node(int nid,
3219 unsigned long zone_type, 3231 unsigned long zone_type,
3220 unsigned long *zones_size) 3232 unsigned long *zones_size)
3221 { 3233 {
3222 return zones_size[zone_type]; 3234 return zones_size[zone_type];
3223 } 3235 }
3224 3236
3225 static inline unsigned long __meminit zone_absent_pages_in_node(int nid, 3237 static inline unsigned long __meminit zone_absent_pages_in_node(int nid,
3226 unsigned long zone_type, 3238 unsigned long zone_type,
3227 unsigned long *zholes_size) 3239 unsigned long *zholes_size)
3228 { 3240 {
3229 if (!zholes_size) 3241 if (!zholes_size)
3230 return 0; 3242 return 0;
3231 3243
3232 return zholes_size[zone_type]; 3244 return zholes_size[zone_type];
3233 } 3245 }
3234 3246
3235 #endif 3247 #endif
3236 3248
3237 static void __meminit calculate_node_totalpages(struct pglist_data *pgdat, 3249 static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
3238 unsigned long *zones_size, unsigned long *zholes_size) 3250 unsigned long *zones_size, unsigned long *zholes_size)
3239 { 3251 {
3240 unsigned long realtotalpages, totalpages = 0; 3252 unsigned long realtotalpages, totalpages = 0;
3241 enum zone_type i; 3253 enum zone_type i;
3242 3254
3243 for (i = 0; i < MAX_NR_ZONES; i++) 3255 for (i = 0; i < MAX_NR_ZONES; i++)
3244 totalpages += zone_spanned_pages_in_node(pgdat->node_id, i, 3256 totalpages += zone_spanned_pages_in_node(pgdat->node_id, i,
3245 zones_size); 3257 zones_size);
3246 pgdat->node_spanned_pages = totalpages; 3258 pgdat->node_spanned_pages = totalpages;
3247 3259
3248 realtotalpages = totalpages; 3260 realtotalpages = totalpages;
3249 for (i = 0; i < MAX_NR_ZONES; i++) 3261 for (i = 0; i < MAX_NR_ZONES; i++)
3250 realtotalpages -= 3262 realtotalpages -=
3251 zone_absent_pages_in_node(pgdat->node_id, i, 3263 zone_absent_pages_in_node(pgdat->node_id, i,
3252 zholes_size); 3264 zholes_size);
3253 pgdat->node_present_pages = realtotalpages; 3265 pgdat->node_present_pages = realtotalpages;
3254 printk(KERN_DEBUG "On node %d totalpages: %lu\n", pgdat->node_id, 3266 printk(KERN_DEBUG "On node %d totalpages: %lu\n", pgdat->node_id,
3255 realtotalpages); 3267 realtotalpages);
3256 } 3268 }
3257 3269
3258 #ifndef CONFIG_SPARSEMEM 3270 #ifndef CONFIG_SPARSEMEM
3259 /* 3271 /*
3260 * Calculate the size of the zone->blockflags rounded to an unsigned long 3272 * Calculate the size of the zone->blockflags rounded to an unsigned long
3261 * Start by making sure zonesize is a multiple of pageblock_order by rounding 3273 * Start by making sure zonesize is a multiple of pageblock_order by rounding
3262 * up. Then use 1 NR_PAGEBLOCK_BITS worth of bits per pageblock, finally 3274 * up. Then use 1 NR_PAGEBLOCK_BITS worth of bits per pageblock, finally
3263 * round what is now in bits to nearest long in bits, then return it in 3275 * round what is now in bits to nearest long in bits, then return it in
3264 * bytes. 3276 * bytes.
3265 */ 3277 */
3266 static unsigned long __init usemap_size(unsigned long zonesize) 3278 static unsigned long __init usemap_size(unsigned long zonesize)
3267 { 3279 {
3268 unsigned long usemapsize; 3280 unsigned long usemapsize;
3269 3281
3270 usemapsize = roundup(zonesize, pageblock_nr_pages); 3282 usemapsize = roundup(zonesize, pageblock_nr_pages);
3271 usemapsize = usemapsize >> pageblock_order; 3283 usemapsize = usemapsize >> pageblock_order;
3272 usemapsize *= NR_PAGEBLOCK_BITS; 3284 usemapsize *= NR_PAGEBLOCK_BITS;
3273 usemapsize = roundup(usemapsize, 8 * sizeof(unsigned long)); 3285 usemapsize = roundup(usemapsize, 8 * sizeof(unsigned long));
3274 3286
3275 return usemapsize / 8; 3287 return usemapsize / 8;
3276 } 3288 }
3277 3289
3278 static void __init setup_usemap(struct pglist_data *pgdat, 3290 static void __init setup_usemap(struct pglist_data *pgdat,
3279 struct zone *zone, unsigned long zonesize) 3291 struct zone *zone, unsigned long zonesize)
3280 { 3292 {
3281 unsigned long usemapsize = usemap_size(zonesize); 3293 unsigned long usemapsize = usemap_size(zonesize);
3282 zone->pageblock_flags = NULL; 3294 zone->pageblock_flags = NULL;
3283 if (usemapsize) { 3295 if (usemapsize) {
3284 zone->pageblock_flags = alloc_bootmem_node(pgdat, usemapsize); 3296 zone->pageblock_flags = alloc_bootmem_node(pgdat, usemapsize);
3285 memset(zone->pageblock_flags, 0, usemapsize); 3297 memset(zone->pageblock_flags, 0, usemapsize);
3286 } 3298 }
3287 } 3299 }
3288 #else 3300 #else
3289 static void inline setup_usemap(struct pglist_data *pgdat, 3301 static void inline setup_usemap(struct pglist_data *pgdat,
3290 struct zone *zone, unsigned long zonesize) {} 3302 struct zone *zone, unsigned long zonesize) {}
3291 #endif /* CONFIG_SPARSEMEM */ 3303 #endif /* CONFIG_SPARSEMEM */
3292 3304
3293 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE 3305 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
3294 3306
3295 /* Return a sensible default order for the pageblock size. */ 3307 /* Return a sensible default order for the pageblock size. */
3296 static inline int pageblock_default_order(void) 3308 static inline int pageblock_default_order(void)
3297 { 3309 {
3298 if (HPAGE_SHIFT > PAGE_SHIFT) 3310 if (HPAGE_SHIFT > PAGE_SHIFT)
3299 return HUGETLB_PAGE_ORDER; 3311 return HUGETLB_PAGE_ORDER;
3300 3312
3301 return MAX_ORDER-1; 3313 return MAX_ORDER-1;
3302 } 3314 }
3303 3315
3304 /* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */ 3316 /* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */
3305 static inline void __init set_pageblock_order(unsigned int order) 3317 static inline void __init set_pageblock_order(unsigned int order)
3306 { 3318 {
3307 /* Check that pageblock_nr_pages has not already been setup */ 3319 /* Check that pageblock_nr_pages has not already been setup */
3308 if (pageblock_order) 3320 if (pageblock_order)
3309 return; 3321 return;
3310 3322
3311 /* 3323 /*
3312 * Assume the largest contiguous order of interest is a huge page. 3324 * Assume the largest contiguous order of interest is a huge page.
3313 * This value may be variable depending on boot parameters on IA64 3325 * This value may be variable depending on boot parameters on IA64
3314 */ 3326 */
3315 pageblock_order = order; 3327 pageblock_order = order;
3316 } 3328 }
3317 #else /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */ 3329 #else /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
3318 3330
3319 /* 3331 /*
3320 * When CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is not set, set_pageblock_order() 3332 * When CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is not set, set_pageblock_order()
3321 * and pageblock_default_order() are unused as pageblock_order is set 3333 * and pageblock_default_order() are unused as pageblock_order is set
3322 * at compile-time. See include/linux/pageblock-flags.h for the values of 3334 * at compile-time. See include/linux/pageblock-flags.h for the values of
3323 * pageblock_order based on the kernel config 3335 * pageblock_order based on the kernel config
3324 */ 3336 */
3325 static inline int pageblock_default_order(unsigned int order) 3337 static inline int pageblock_default_order(unsigned int order)
3326 { 3338 {
3327 return MAX_ORDER-1; 3339 return MAX_ORDER-1;
3328 } 3340 }
3329 #define set_pageblock_order(x) do {} while (0) 3341 #define set_pageblock_order(x) do {} while (0)
3330 3342
3331 #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */ 3343 #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
3332 3344
3333 /* 3345 /*
3334 * Set up the zone data structures: 3346 * Set up the zone data structures:
3335 * - mark all pages reserved 3347 * - mark all pages reserved
3336 * - mark all memory queues empty 3348 * - mark all memory queues empty
3337 * - clear the memory bitmaps 3349 * - clear the memory bitmaps
3338 */ 3350 */
3339 static void __paginginit free_area_init_core(struct pglist_data *pgdat, 3351 static void __paginginit free_area_init_core(struct pglist_data *pgdat,
3340 unsigned long *zones_size, unsigned long *zholes_size) 3352 unsigned long *zones_size, unsigned long *zholes_size)
3341 { 3353 {
3342 enum zone_type j; 3354 enum zone_type j;
3343 int nid = pgdat->node_id; 3355 int nid = pgdat->node_id;
3344 unsigned long zone_start_pfn = pgdat->node_start_pfn; 3356 unsigned long zone_start_pfn = pgdat->node_start_pfn;
3345 int ret; 3357 int ret;
3346 3358
3347 pgdat_resize_init(pgdat); 3359 pgdat_resize_init(pgdat);
3348 pgdat->nr_zones = 0; 3360 pgdat->nr_zones = 0;
3349 init_waitqueue_head(&pgdat->kswapd_wait); 3361 init_waitqueue_head(&pgdat->kswapd_wait);
3350 pgdat->kswapd_max_order = 0; 3362 pgdat->kswapd_max_order = 0;
3351 3363
3352 for (j = 0; j < MAX_NR_ZONES; j++) { 3364 for (j = 0; j < MAX_NR_ZONES; j++) {
3353 struct zone *zone = pgdat->node_zones + j; 3365 struct zone *zone = pgdat->node_zones + j;
3354 unsigned long size, realsize, memmap_pages; 3366 unsigned long size, realsize, memmap_pages;
3355 3367
3356 size = zone_spanned_pages_in_node(nid, j, zones_size); 3368 size = zone_spanned_pages_in_node(nid, j, zones_size);
3357 realsize = size - zone_absent_pages_in_node(nid, j, 3369 realsize = size - zone_absent_pages_in_node(nid, j,
3358 zholes_size); 3370 zholes_size);
3359 3371
3360 /* 3372 /*
3361 * Adjust realsize so that it accounts for how much memory 3373 * Adjust realsize so that it accounts for how much memory
3362 * is used by this zone for memmap. This affects the watermark 3374 * is used by this zone for memmap. This affects the watermark
3363 * and per-cpu initialisations 3375 * and per-cpu initialisations
3364 */ 3376 */
3365 memmap_pages = (size * sizeof(struct page)) >> PAGE_SHIFT; 3377 memmap_pages = (size * sizeof(struct page)) >> PAGE_SHIFT;
3366 if (realsize >= memmap_pages) { 3378 if (realsize >= memmap_pages) {
3367 realsize -= memmap_pages; 3379 realsize -= memmap_pages;
3368 printk(KERN_DEBUG 3380 printk(KERN_DEBUG
3369 " %s zone: %lu pages used for memmap\n", 3381 " %s zone: %lu pages used for memmap\n",
3370 zone_names[j], memmap_pages); 3382 zone_names[j], memmap_pages);
3371 } else 3383 } else
3372 printk(KERN_WARNING 3384 printk(KERN_WARNING
3373 " %s zone: %lu pages exceeds realsize %lu\n", 3385 " %s zone: %lu pages exceeds realsize %lu\n",
3374 zone_names[j], memmap_pages, realsize); 3386 zone_names[j], memmap_pages, realsize);
3375 3387
3376 /* Account for reserved pages */ 3388 /* Account for reserved pages */
3377 if (j == 0 && realsize > dma_reserve) { 3389 if (j == 0 && realsize > dma_reserve) {
3378 realsize -= dma_reserve; 3390 realsize -= dma_reserve;
3379 printk(KERN_DEBUG " %s zone: %lu pages reserved\n", 3391 printk(KERN_DEBUG " %s zone: %lu pages reserved\n",
3380 zone_names[0], dma_reserve); 3392 zone_names[0], dma_reserve);
3381 } 3393 }
3382 3394
3383 if (!is_highmem_idx(j)) 3395 if (!is_highmem_idx(j))
3384 nr_kernel_pages += realsize; 3396 nr_kernel_pages += realsize;
3385 nr_all_pages += realsize; 3397 nr_all_pages += realsize;
3386 3398
3387 zone->spanned_pages = size; 3399 zone->spanned_pages = size;
3388 zone->present_pages = realsize; 3400 zone->present_pages = realsize;
3389 #ifdef CONFIG_NUMA 3401 #ifdef CONFIG_NUMA
3390 zone->node = nid; 3402 zone->node = nid;
3391 zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio) 3403 zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
3392 / 100; 3404 / 100;
3393 zone->min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100; 3405 zone->min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100;
3394 #endif 3406 #endif
3395 zone->name = zone_names[j]; 3407 zone->name = zone_names[j];
3396 spin_lock_init(&zone->lock); 3408 spin_lock_init(&zone->lock);
3397 spin_lock_init(&zone->lru_lock); 3409 spin_lock_init(&zone->lru_lock);
3398 zone_seqlock_init(zone); 3410 zone_seqlock_init(zone);
3399 zone->zone_pgdat = pgdat; 3411 zone->zone_pgdat = pgdat;
3400 3412
3401 zone->prev_priority = DEF_PRIORITY; 3413 zone->prev_priority = DEF_PRIORITY;
3402 3414
3403 zone_pcp_init(zone); 3415 zone_pcp_init(zone);
3404 INIT_LIST_HEAD(&zone->active_list); 3416 INIT_LIST_HEAD(&zone->active_list);
3405 INIT_LIST_HEAD(&zone->inactive_list); 3417 INIT_LIST_HEAD(&zone->inactive_list);
3406 zone->nr_scan_active = 0; 3418 zone->nr_scan_active = 0;
3407 zone->nr_scan_inactive = 0; 3419 zone->nr_scan_inactive = 0;
3408 zap_zone_vm_stats(zone); 3420 zap_zone_vm_stats(zone);
3409 zone->flags = 0; 3421 zone->flags = 0;
3410 if (!size) 3422 if (!size)
3411 continue; 3423 continue;
3412 3424
3413 set_pageblock_order(pageblock_default_order()); 3425 set_pageblock_order(pageblock_default_order());
3414 setup_usemap(pgdat, zone, size); 3426 setup_usemap(pgdat, zone, size);
3415 ret = init_currently_empty_zone(zone, zone_start_pfn, 3427 ret = init_currently_empty_zone(zone, zone_start_pfn,
3416 size, MEMMAP_EARLY); 3428 size, MEMMAP_EARLY);
3417 BUG_ON(ret); 3429 BUG_ON(ret);
3418 zone_start_pfn += size; 3430 zone_start_pfn += size;
3419 } 3431 }
3420 } 3432 }
3421 3433
3422 static void __init_refok alloc_node_mem_map(struct pglist_data *pgdat) 3434 static void __init_refok alloc_node_mem_map(struct pglist_data *pgdat)
3423 { 3435 {
3424 /* Skip empty nodes */ 3436 /* Skip empty nodes */
3425 if (!pgdat->node_spanned_pages) 3437 if (!pgdat->node_spanned_pages)
3426 return; 3438 return;
3427 3439
3428 #ifdef CONFIG_FLAT_NODE_MEM_MAP 3440 #ifdef CONFIG_FLAT_NODE_MEM_MAP
3429 /* ia64 gets its own node_mem_map, before this, without bootmem */ 3441 /* ia64 gets its own node_mem_map, before this, without bootmem */
3430 if (!pgdat->node_mem_map) { 3442 if (!pgdat->node_mem_map) {
3431 unsigned long size, start, end; 3443 unsigned long size, start, end;
3432 struct page *map; 3444 struct page *map;
3433 3445
3434 /* 3446 /*
3435 * The zone's endpoints aren't required to be MAX_ORDER 3447 * The zone's endpoints aren't required to be MAX_ORDER
3436 * aligned but the node_mem_map endpoints must be in order 3448 * aligned but the node_mem_map endpoints must be in order
3437 * for the buddy allocator to function correctly. 3449 * for the buddy allocator to function correctly.
3438 */ 3450 */
3439 start = pgdat->node_start_pfn & ~(MAX_ORDER_NR_PAGES - 1); 3451 start = pgdat->node_start_pfn & ~(MAX_ORDER_NR_PAGES - 1);
3440 end = pgdat->node_start_pfn + pgdat->node_spanned_pages; 3452 end = pgdat->node_start_pfn + pgdat->node_spanned_pages;
3441 end = ALIGN(end, MAX_ORDER_NR_PAGES); 3453 end = ALIGN(end, MAX_ORDER_NR_PAGES);
3442 size = (end - start) * sizeof(struct page); 3454 size = (end - start) * sizeof(struct page);
3443 map = alloc_remap(pgdat->node_id, size); 3455 map = alloc_remap(pgdat->node_id, size);
3444 if (!map) 3456 if (!map)
3445 map = alloc_bootmem_node(pgdat, size); 3457 map = alloc_bootmem_node(pgdat, size);
3446 pgdat->node_mem_map = map + (pgdat->node_start_pfn - start); 3458 pgdat->node_mem_map = map + (pgdat->node_start_pfn - start);
3447 } 3459 }
3448 #ifndef CONFIG_NEED_MULTIPLE_NODES 3460 #ifndef CONFIG_NEED_MULTIPLE_NODES
3449 /* 3461 /*
3450 * With no DISCONTIG, the global mem_map is just set as node 0's 3462 * With no DISCONTIG, the global mem_map is just set as node 0's
3451 */ 3463 */
3452 if (pgdat == NODE_DATA(0)) { 3464 if (pgdat == NODE_DATA(0)) {
3453 mem_map = NODE_DATA(0)->node_mem_map; 3465 mem_map = NODE_DATA(0)->node_mem_map;
3454 #ifdef CONFIG_ARCH_POPULATES_NODE_MAP 3466 #ifdef CONFIG_ARCH_POPULATES_NODE_MAP
3455 if (page_to_pfn(mem_map) != pgdat->node_start_pfn) 3467 if (page_to_pfn(mem_map) != pgdat->node_start_pfn)
3456 mem_map -= (pgdat->node_start_pfn - ARCH_PFN_OFFSET); 3468 mem_map -= (pgdat->node_start_pfn - ARCH_PFN_OFFSET);
3457 #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */ 3469 #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
3458 } 3470 }
3459 #endif 3471 #endif
3460 #endif /* CONFIG_FLAT_NODE_MEM_MAP */ 3472 #endif /* CONFIG_FLAT_NODE_MEM_MAP */
3461 } 3473 }
3462 3474
3463 void __paginginit free_area_init_node(int nid, struct pglist_data *pgdat, 3475 void __paginginit free_area_init_node(int nid, struct pglist_data *pgdat,
3464 unsigned long *zones_size, unsigned long node_start_pfn, 3476 unsigned long *zones_size, unsigned long node_start_pfn,
3465 unsigned long *zholes_size) 3477 unsigned long *zholes_size)
3466 { 3478 {
3467 pgdat->node_id = nid; 3479 pgdat->node_id = nid;
3468 pgdat->node_start_pfn = node_start_pfn; 3480 pgdat->node_start_pfn = node_start_pfn;
3469 calculate_node_totalpages(pgdat, zones_size, zholes_size); 3481 calculate_node_totalpages(pgdat, zones_size, zholes_size);
3470 3482
3471 alloc_node_mem_map(pgdat); 3483 alloc_node_mem_map(pgdat);
3472 3484
3473 free_area_init_core(pgdat, zones_size, zholes_size); 3485 free_area_init_core(pgdat, zones_size, zholes_size);
3474 } 3486 }
3475 3487
3476 #ifdef CONFIG_ARCH_POPULATES_NODE_MAP 3488 #ifdef CONFIG_ARCH_POPULATES_NODE_MAP
3477 3489
3478 #if MAX_NUMNODES > 1 3490 #if MAX_NUMNODES > 1
3479 /* 3491 /*
3480 * Figure out the number of possible node ids. 3492 * Figure out the number of possible node ids.
3481 */ 3493 */
3482 static void __init setup_nr_node_ids(void) 3494 static void __init setup_nr_node_ids(void)
3483 { 3495 {
3484 unsigned int node; 3496 unsigned int node;
3485 unsigned int highest = 0; 3497 unsigned int highest = 0;
3486 3498
3487 for_each_node_mask(node, node_possible_map) 3499 for_each_node_mask(node, node_possible_map)
3488 highest = node; 3500 highest = node;
3489 nr_node_ids = highest + 1; 3501 nr_node_ids = highest + 1;
3490 } 3502 }
3491 #else 3503 #else
3492 static inline void setup_nr_node_ids(void) 3504 static inline void setup_nr_node_ids(void)
3493 { 3505 {
3494 } 3506 }
3495 #endif 3507 #endif
3496 3508
3497 /** 3509 /**
3498 * add_active_range - Register a range of PFNs backed by physical memory 3510 * add_active_range - Register a range of PFNs backed by physical memory
3499 * @nid: The node ID the range resides on 3511 * @nid: The node ID the range resides on
3500 * @start_pfn: The start PFN of the available physical memory 3512 * @start_pfn: The start PFN of the available physical memory
3501 * @end_pfn: The end PFN of the available physical memory 3513 * @end_pfn: The end PFN of the available physical memory
3502 * 3514 *
3503 * These ranges are stored in an early_node_map[] and later used by 3515 * These ranges are stored in an early_node_map[] and later used by
3504 * free_area_init_nodes() to calculate zone sizes and holes. If the 3516 * free_area_init_nodes() to calculate zone sizes and holes. If the
3505 * range spans a memory hole, it is up to the architecture to ensure 3517 * range spans a memory hole, it is up to the architecture to ensure
3506 * the memory is not freed by the bootmem allocator. If possible 3518 * the memory is not freed by the bootmem allocator. If possible
3507 * the range being registered will be merged with existing ranges. 3519 * the range being registered will be merged with existing ranges.
3508 */ 3520 */
3509 void __init add_active_range(unsigned int nid, unsigned long start_pfn, 3521 void __init add_active_range(unsigned int nid, unsigned long start_pfn,
3510 unsigned long end_pfn) 3522 unsigned long end_pfn)
3511 { 3523 {
3512 int i; 3524 int i;
3513 3525
3514 printk(KERN_DEBUG "Entering add_active_range(%d, %lu, %lu) " 3526 printk(KERN_DEBUG "Entering add_active_range(%d, %lu, %lu) "
3515 "%d entries of %d used\n", 3527 "%d entries of %d used\n",
3516 nid, start_pfn, end_pfn, 3528 nid, start_pfn, end_pfn,
3517 nr_nodemap_entries, MAX_ACTIVE_REGIONS); 3529 nr_nodemap_entries, MAX_ACTIVE_REGIONS);
3518 3530
3519 /* Merge with existing active regions if possible */ 3531 /* Merge with existing active regions if possible */
3520 for (i = 0; i < nr_nodemap_entries; i++) { 3532 for (i = 0; i < nr_nodemap_entries; i++) {
3521 if (early_node_map[i].nid != nid) 3533 if (early_node_map[i].nid != nid)
3522 continue; 3534 continue;
3523 3535
3524 /* Skip if an existing region covers this new one */ 3536 /* Skip if an existing region covers this new one */
3525 if (start_pfn >= early_node_map[i].start_pfn && 3537 if (start_pfn >= early_node_map[i].start_pfn &&
3526 end_pfn <= early_node_map[i].end_pfn) 3538 end_pfn <= early_node_map[i].end_pfn)
3527 return; 3539 return;
3528 3540
3529 /* Merge forward if suitable */ 3541 /* Merge forward if suitable */
3530 if (start_pfn <= early_node_map[i].end_pfn && 3542 if (start_pfn <= early_node_map[i].end_pfn &&
3531 end_pfn > early_node_map[i].end_pfn) { 3543 end_pfn > early_node_map[i].end_pfn) {
3532 early_node_map[i].end_pfn = end_pfn; 3544 early_node_map[i].end_pfn = end_pfn;
3533 return; 3545 return;
3534 } 3546 }
3535 3547
3536 /* Merge backward if suitable */ 3548 /* Merge backward if suitable */
3537 if (start_pfn < early_node_map[i].end_pfn && 3549 if (start_pfn < early_node_map[i].end_pfn &&
3538 end_pfn >= early_node_map[i].start_pfn) { 3550 end_pfn >= early_node_map[i].start_pfn) {
3539 early_node_map[i].start_pfn = start_pfn; 3551 early_node_map[i].start_pfn = start_pfn;
3540 return; 3552 return;
3541 } 3553 }
3542 } 3554 }
3543 3555
3544 /* Check that early_node_map is large enough */ 3556 /* Check that early_node_map is large enough */
3545 if (i >= MAX_ACTIVE_REGIONS) { 3557 if (i >= MAX_ACTIVE_REGIONS) {
3546 printk(KERN_CRIT "More than %d memory regions, truncating\n", 3558 printk(KERN_CRIT "More than %d memory regions, truncating\n",
3547 MAX_ACTIVE_REGIONS); 3559 MAX_ACTIVE_REGIONS);
3548 return; 3560 return;
3549 } 3561 }
3550 3562
3551 early_node_map[i].nid = nid; 3563 early_node_map[i].nid = nid;
3552 early_node_map[i].start_pfn = start_pfn; 3564 early_node_map[i].start_pfn = start_pfn;
3553 early_node_map[i].end_pfn = end_pfn; 3565 early_node_map[i].end_pfn = end_pfn;
3554 nr_nodemap_entries = i + 1; 3566 nr_nodemap_entries = i + 1;
3555 } 3567 }
3556 3568
3557 /** 3569 /**
3558 * shrink_active_range - Shrink an existing registered range of PFNs 3570 * shrink_active_range - Shrink an existing registered range of PFNs
3559 * @nid: The node id the range is on that should be shrunk 3571 * @nid: The node id the range is on that should be shrunk
3560 * @old_end_pfn: The old end PFN of the range 3572 * @old_end_pfn: The old end PFN of the range
3561 * @new_end_pfn: The new PFN of the range 3573 * @new_end_pfn: The new PFN of the range
3562 * 3574 *
3563 * i386 with NUMA use alloc_remap() to store a node_mem_map on a local node. 3575 * i386 with NUMA use alloc_remap() to store a node_mem_map on a local node.
3564 * The map is kept at the end physical page range that has already been 3576 * The map is kept at the end physical page range that has already been
3565 * registered with add_active_range(). This function allows an arch to shrink 3577 * registered with add_active_range(). This function allows an arch to shrink
3566 * an existing registered range. 3578 * an existing registered range.
3567 */ 3579 */
3568 void __init shrink_active_range(unsigned int nid, unsigned long old_end_pfn, 3580 void __init shrink_active_range(unsigned int nid, unsigned long old_end_pfn,
3569 unsigned long new_end_pfn) 3581 unsigned long new_end_pfn)
3570 { 3582 {
3571 int i; 3583 int i;
3572 3584
3573 /* Find the old active region end and shrink */ 3585 /* Find the old active region end and shrink */
3574 for_each_active_range_index_in_nid(i, nid) 3586 for_each_active_range_index_in_nid(i, nid)
3575 if (early_node_map[i].end_pfn == old_end_pfn) { 3587 if (early_node_map[i].end_pfn == old_end_pfn) {
3576 early_node_map[i].end_pfn = new_end_pfn; 3588 early_node_map[i].end_pfn = new_end_pfn;
3577 break; 3589 break;
3578 } 3590 }
3579 } 3591 }
3580 3592
3581 /** 3593 /**
3582 * remove_all_active_ranges - Remove all currently registered regions 3594 * remove_all_active_ranges - Remove all currently registered regions
3583 * 3595 *
3584 * During discovery, it may be found that a table like SRAT is invalid 3596 * During discovery, it may be found that a table like SRAT is invalid
3585 * and an alternative discovery method must be used. This function removes 3597 * and an alternative discovery method must be used. This function removes
3586 * all currently registered regions. 3598 * all currently registered regions.
3587 */ 3599 */
3588 void __init remove_all_active_ranges(void) 3600 void __init remove_all_active_ranges(void)
3589 { 3601 {
3590 memset(early_node_map, 0, sizeof(early_node_map)); 3602 memset(early_node_map, 0, sizeof(early_node_map));
3591 nr_nodemap_entries = 0; 3603 nr_nodemap_entries = 0;
3592 #ifdef CONFIG_MEMORY_HOTPLUG_RESERVE 3604 #ifdef CONFIG_MEMORY_HOTPLUG_RESERVE
3593 memset(node_boundary_start_pfn, 0, sizeof(node_boundary_start_pfn)); 3605 memset(node_boundary_start_pfn, 0, sizeof(node_boundary_start_pfn));
3594 memset(node_boundary_end_pfn, 0, sizeof(node_boundary_end_pfn)); 3606 memset(node_boundary_end_pfn, 0, sizeof(node_boundary_end_pfn));
3595 #endif /* CONFIG_MEMORY_HOTPLUG_RESERVE */ 3607 #endif /* CONFIG_MEMORY_HOTPLUG_RESERVE */
3596 } 3608 }
3597 3609
3598 /* Compare two active node_active_regions */ 3610 /* Compare two active node_active_regions */
3599 static int __init cmp_node_active_region(const void *a, const void *b) 3611 static int __init cmp_node_active_region(const void *a, const void *b)
3600 { 3612 {
3601 struct node_active_region *arange = (struct node_active_region *)a; 3613 struct node_active_region *arange = (struct node_active_region *)a;
3602 struct node_active_region *brange = (struct node_active_region *)b; 3614 struct node_active_region *brange = (struct node_active_region *)b;
3603 3615
3604 /* Done this way to avoid overflows */ 3616 /* Done this way to avoid overflows */
3605 if (arange->start_pfn > brange->start_pfn) 3617 if (arange->start_pfn > brange->start_pfn)
3606 return 1; 3618 return 1;
3607 if (arange->start_pfn < brange->start_pfn) 3619 if (arange->start_pfn < brange->start_pfn)
3608 return -1; 3620 return -1;
3609 3621
3610 return 0; 3622 return 0;
3611 } 3623 }
3612 3624
3613 /* sort the node_map by start_pfn */ 3625 /* sort the node_map by start_pfn */
3614 static void __init sort_node_map(void) 3626 static void __init sort_node_map(void)
3615 { 3627 {
3616 sort(early_node_map, (size_t)nr_nodemap_entries, 3628 sort(early_node_map, (size_t)nr_nodemap_entries,
3617 sizeof(struct node_active_region), 3629 sizeof(struct node_active_region),
3618 cmp_node_active_region, NULL); 3630 cmp_node_active_region, NULL);
3619 } 3631 }
3620 3632
3621 /* Find the lowest pfn for a node */ 3633 /* Find the lowest pfn for a node */
3622 unsigned long __init find_min_pfn_for_node(unsigned long nid) 3634 unsigned long __init find_min_pfn_for_node(unsigned long nid)
3623 { 3635 {
3624 int i; 3636 int i;
3625 unsigned long min_pfn = ULONG_MAX; 3637 unsigned long min_pfn = ULONG_MAX;
3626 3638
3627 /* Assuming a sorted map, the first range found has the starting pfn */ 3639 /* Assuming a sorted map, the first range found has the starting pfn */
3628 for_each_active_range_index_in_nid(i, nid) 3640 for_each_active_range_index_in_nid(i, nid)
3629 min_pfn = min(min_pfn, early_node_map[i].start_pfn); 3641 min_pfn = min(min_pfn, early_node_map[i].start_pfn);
3630 3642
3631 if (min_pfn == ULONG_MAX) { 3643 if (min_pfn == ULONG_MAX) {
3632 printk(KERN_WARNING 3644 printk(KERN_WARNING
3633 "Could not find start_pfn for node %lu\n", nid); 3645 "Could not find start_pfn for node %lu\n", nid);
3634 return 0; 3646 return 0;
3635 } 3647 }
3636 3648
3637 return min_pfn; 3649 return min_pfn;
3638 } 3650 }
3639 3651
3640 /** 3652 /**
3641 * find_min_pfn_with_active_regions - Find the minimum PFN registered 3653 * find_min_pfn_with_active_regions - Find the minimum PFN registered
3642 * 3654 *
3643 * It returns the minimum PFN based on information provided via 3655 * It returns the minimum PFN based on information provided via
3644 * add_active_range(). 3656 * add_active_range().
3645 */ 3657 */
3646 unsigned long __init find_min_pfn_with_active_regions(void) 3658 unsigned long __init find_min_pfn_with_active_regions(void)
3647 { 3659 {
3648 return find_min_pfn_for_node(MAX_NUMNODES); 3660 return find_min_pfn_for_node(MAX_NUMNODES);
3649 } 3661 }
3650 3662
3651 /** 3663 /**
3652 * find_max_pfn_with_active_regions - Find the maximum PFN registered 3664 * find_max_pfn_with_active_regions - Find the maximum PFN registered
3653 * 3665 *
3654 * It returns the maximum PFN based on information provided via 3666 * It returns the maximum PFN based on information provided via
3655 * add_active_range(). 3667 * add_active_range().
3656 */ 3668 */
3657 unsigned long __init find_max_pfn_with_active_regions(void) 3669 unsigned long __init find_max_pfn_with_active_regions(void)
3658 { 3670 {
3659 int i; 3671 int i;
3660 unsigned long max_pfn = 0; 3672 unsigned long max_pfn = 0;
3661 3673
3662 for (i = 0; i < nr_nodemap_entries; i++) 3674 for (i = 0; i < nr_nodemap_entries; i++)
3663 max_pfn = max(max_pfn, early_node_map[i].end_pfn); 3675 max_pfn = max(max_pfn, early_node_map[i].end_pfn);
3664 3676
3665 return max_pfn; 3677 return max_pfn;
3666 } 3678 }
3667 3679
3668 /* 3680 /*
3669 * early_calculate_totalpages() 3681 * early_calculate_totalpages()
3670 * Sum pages in active regions for movable zone. 3682 * Sum pages in active regions for movable zone.
3671 * Populate N_HIGH_MEMORY for calculating usable_nodes. 3683 * Populate N_HIGH_MEMORY for calculating usable_nodes.
3672 */ 3684 */
3673 static unsigned long __init early_calculate_totalpages(void) 3685 static unsigned long __init early_calculate_totalpages(void)
3674 { 3686 {
3675 int i; 3687 int i;
3676 unsigned long totalpages = 0; 3688 unsigned long totalpages = 0;
3677 3689
3678 for (i = 0; i < nr_nodemap_entries; i++) { 3690 for (i = 0; i < nr_nodemap_entries; i++) {
3679 unsigned long pages = early_node_map[i].end_pfn - 3691 unsigned long pages = early_node_map[i].end_pfn -
3680 early_node_map[i].start_pfn; 3692 early_node_map[i].start_pfn;
3681 totalpages += pages; 3693 totalpages += pages;
3682 if (pages) 3694 if (pages)
3683 node_set_state(early_node_map[i].nid, N_HIGH_MEMORY); 3695 node_set_state(early_node_map[i].nid, N_HIGH_MEMORY);
3684 } 3696 }
3685 return totalpages; 3697 return totalpages;
3686 } 3698 }
3687 3699
3688 /* 3700 /*
3689 * Find the PFN the Movable zone begins in each node. Kernel memory 3701 * Find the PFN the Movable zone begins in each node. Kernel memory
3690 * is spread evenly between nodes as long as the nodes have enough 3702 * is spread evenly between nodes as long as the nodes have enough
3691 * memory. When they don't, some nodes will have more kernelcore than 3703 * memory. When they don't, some nodes will have more kernelcore than
3692 * others 3704 * others
3693 */ 3705 */
3694 void __init find_zone_movable_pfns_for_nodes(unsigned long *movable_pfn) 3706 void __init find_zone_movable_pfns_for_nodes(unsigned long *movable_pfn)
3695 { 3707 {
3696 int i, nid; 3708 int i, nid;
3697 unsigned long usable_startpfn; 3709 unsigned long usable_startpfn;
3698 unsigned long kernelcore_node, kernelcore_remaining; 3710 unsigned long kernelcore_node, kernelcore_remaining;
3699 unsigned long totalpages = early_calculate_totalpages(); 3711 unsigned long totalpages = early_calculate_totalpages();
3700 int usable_nodes = nodes_weight(node_states[N_HIGH_MEMORY]); 3712 int usable_nodes = nodes_weight(node_states[N_HIGH_MEMORY]);
3701 3713
3702 /* 3714 /*
3703 * If movablecore was specified, calculate what size of 3715 * If movablecore was specified, calculate what size of
3704 * kernelcore that corresponds so that memory usable for 3716 * kernelcore that corresponds so that memory usable for
3705 * any allocation type is evenly spread. If both kernelcore 3717 * any allocation type is evenly spread. If both kernelcore
3706 * and movablecore are specified, then the value of kernelcore 3718 * and movablecore are specified, then the value of kernelcore
3707 * will be used for required_kernelcore if it's greater than 3719 * will be used for required_kernelcore if it's greater than
3708 * what movablecore would have allowed. 3720 * what movablecore would have allowed.
3709 */ 3721 */
3710 if (required_movablecore) { 3722 if (required_movablecore) {
3711 unsigned long corepages; 3723 unsigned long corepages;
3712 3724
3713 /* 3725 /*
3714 * Round-up so that ZONE_MOVABLE is at least as large as what 3726 * Round-up so that ZONE_MOVABLE is at least as large as what
3715 * was requested by the user 3727 * was requested by the user
3716 */ 3728 */
3717 required_movablecore = 3729 required_movablecore =
3718 roundup(required_movablecore, MAX_ORDER_NR_PAGES); 3730 roundup(required_movablecore, MAX_ORDER_NR_PAGES);
3719 corepages = totalpages - required_movablecore; 3731 corepages = totalpages - required_movablecore;
3720 3732
3721 required_kernelcore = max(required_kernelcore, corepages); 3733 required_kernelcore = max(required_kernelcore, corepages);
3722 } 3734 }
3723 3735
3724 /* If kernelcore was not specified, there is no ZONE_MOVABLE */ 3736 /* If kernelcore was not specified, there is no ZONE_MOVABLE */
3725 if (!required_kernelcore) 3737 if (!required_kernelcore)
3726 return; 3738 return;
3727 3739
3728 /* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */ 3740 /* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
3729 find_usable_zone_for_movable(); 3741 find_usable_zone_for_movable();
3730 usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone]; 3742 usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
3731 3743
3732 restart: 3744 restart:
3733 /* Spread kernelcore memory as evenly as possible throughout nodes */ 3745 /* Spread kernelcore memory as evenly as possible throughout nodes */
3734 kernelcore_node = required_kernelcore / usable_nodes; 3746 kernelcore_node = required_kernelcore / usable_nodes;
3735 for_each_node_state(nid, N_HIGH_MEMORY) { 3747 for_each_node_state(nid, N_HIGH_MEMORY) {
3736 /* 3748 /*
3737 * Recalculate kernelcore_node if the division per node 3749 * Recalculate kernelcore_node if the division per node
3738 * now exceeds what is necessary to satisfy the requested 3750 * now exceeds what is necessary to satisfy the requested
3739 * amount of memory for the kernel 3751 * amount of memory for the kernel
3740 */ 3752 */
3741 if (required_kernelcore < kernelcore_node) 3753 if (required_kernelcore < kernelcore_node)
3742 kernelcore_node = required_kernelcore / usable_nodes; 3754 kernelcore_node = required_kernelcore / usable_nodes;
3743 3755
3744 /* 3756 /*
3745 * As the map is walked, we track how much memory is usable 3757 * As the map is walked, we track how much memory is usable
3746 * by the kernel using kernelcore_remaining. When it is 3758 * by the kernel using kernelcore_remaining. When it is
3747 * 0, the rest of the node is usable by ZONE_MOVABLE 3759 * 0, the rest of the node is usable by ZONE_MOVABLE
3748 */ 3760 */
3749 kernelcore_remaining = kernelcore_node; 3761 kernelcore_remaining = kernelcore_node;
3750 3762
3751 /* Go through each range of PFNs within this node */ 3763 /* Go through each range of PFNs within this node */
3752 for_each_active_range_index_in_nid(i, nid) { 3764 for_each_active_range_index_in_nid(i, nid) {
3753 unsigned long start_pfn, end_pfn; 3765 unsigned long start_pfn, end_pfn;
3754 unsigned long size_pages; 3766 unsigned long size_pages;
3755 3767
3756 start_pfn = max(early_node_map[i].start_pfn, 3768 start_pfn = max(early_node_map[i].start_pfn,
3757 zone_movable_pfn[nid]); 3769 zone_movable_pfn[nid]);
3758 end_pfn = early_node_map[i].end_pfn; 3770 end_pfn = early_node_map[i].end_pfn;
3759 if (start_pfn >= end_pfn) 3771 if (start_pfn >= end_pfn)
3760 continue; 3772 continue;
3761 3773
3762 /* Account for what is only usable for kernelcore */ 3774 /* Account for what is only usable for kernelcore */
3763 if (start_pfn < usable_startpfn) { 3775 if (start_pfn < usable_startpfn) {
3764 unsigned long kernel_pages; 3776 unsigned long kernel_pages;
3765 kernel_pages = min(end_pfn, usable_startpfn) 3777 kernel_pages = min(end_pfn, usable_startpfn)
3766 - start_pfn; 3778 - start_pfn;
3767 3779
3768 kernelcore_remaining -= min(kernel_pages, 3780 kernelcore_remaining -= min(kernel_pages,
3769 kernelcore_remaining); 3781 kernelcore_remaining);
3770 required_kernelcore -= min(kernel_pages, 3782 required_kernelcore -= min(kernel_pages,
3771 required_kernelcore); 3783 required_kernelcore);
3772 3784
3773 /* Continue if range is now fully accounted */ 3785 /* Continue if range is now fully accounted */
3774 if (end_pfn <= usable_startpfn) { 3786 if (end_pfn <= usable_startpfn) {
3775 3787
3776 /* 3788 /*
3777 * Push zone_movable_pfn to the end so 3789 * Push zone_movable_pfn to the end so
3778 * that if we have to rebalance 3790 * that if we have to rebalance
3779 * kernelcore across nodes, we will 3791 * kernelcore across nodes, we will
3780 * not double account here 3792 * not double account here
3781 */ 3793 */
3782 zone_movable_pfn[nid] = end_pfn; 3794 zone_movable_pfn[nid] = end_pfn;
3783 continue; 3795 continue;
3784 } 3796 }
3785 start_pfn = usable_startpfn; 3797 start_pfn = usable_startpfn;
3786 } 3798 }
3787 3799
3788 /* 3800 /*
3789 * The usable PFN range for ZONE_MOVABLE is from 3801 * The usable PFN range for ZONE_MOVABLE is from
3790 * start_pfn->end_pfn. Calculate size_pages as the 3802 * start_pfn->end_pfn. Calculate size_pages as the
3791 * number of pages used as kernelcore 3803 * number of pages used as kernelcore
3792 */ 3804 */
3793 size_pages = end_pfn - start_pfn; 3805 size_pages = end_pfn - start_pfn;
3794 if (size_pages > kernelcore_remaining) 3806 if (size_pages > kernelcore_remaining)
3795 size_pages = kernelcore_remaining; 3807 size_pages = kernelcore_remaining;
3796 zone_movable_pfn[nid] = start_pfn + size_pages; 3808 zone_movable_pfn[nid] = start_pfn + size_pages;
3797 3809
3798 /* 3810 /*
3799 * Some kernelcore has been met, update counts and 3811 * Some kernelcore has been met, update counts and
3800 * break if the kernelcore for this node has been 3812 * break if the kernelcore for this node has been
3801 * satisified 3813 * satisified
3802 */ 3814 */
3803 required_kernelcore -= min(required_kernelcore, 3815 required_kernelcore -= min(required_kernelcore,
3804 size_pages); 3816 size_pages);
3805 kernelcore_remaining -= size_pages; 3817 kernelcore_remaining -= size_pages;
3806 if (!kernelcore_remaining) 3818 if (!kernelcore_remaining)
3807 break; 3819 break;
3808 } 3820 }
3809 } 3821 }
3810 3822
3811 /* 3823 /*
3812 * If there is still required_kernelcore, we do another pass with one 3824 * If there is still required_kernelcore, we do another pass with one
3813 * less node in the count. This will push zone_movable_pfn[nid] further 3825 * less node in the count. This will push zone_movable_pfn[nid] further
3814 * along on the nodes that still have memory until kernelcore is 3826 * along on the nodes that still have memory until kernelcore is
3815 * satisified 3827 * satisified
3816 */ 3828 */
3817 usable_nodes--; 3829 usable_nodes--;
3818 if (usable_nodes && required_kernelcore > usable_nodes) 3830 if (usable_nodes && required_kernelcore > usable_nodes)
3819 goto restart; 3831 goto restart;
3820 3832
3821 /* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */ 3833 /* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
3822 for (nid = 0; nid < MAX_NUMNODES; nid++) 3834 for (nid = 0; nid < MAX_NUMNODES; nid++)
3823 zone_movable_pfn[nid] = 3835 zone_movable_pfn[nid] =
3824 roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES); 3836 roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
3825 } 3837 }
3826 3838
3827 /* Any regular memory on that node ? */ 3839 /* Any regular memory on that node ? */
3828 static void check_for_regular_memory(pg_data_t *pgdat) 3840 static void check_for_regular_memory(pg_data_t *pgdat)
3829 { 3841 {
3830 #ifdef CONFIG_HIGHMEM 3842 #ifdef CONFIG_HIGHMEM
3831 enum zone_type zone_type; 3843 enum zone_type zone_type;
3832 3844
3833 for (zone_type = 0; zone_type <= ZONE_NORMAL; zone_type++) { 3845 for (zone_type = 0; zone_type <= ZONE_NORMAL; zone_type++) {
3834 struct zone *zone = &pgdat->node_zones[zone_type]; 3846 struct zone *zone = &pgdat->node_zones[zone_type];
3835 if (zone->present_pages) 3847 if (zone->present_pages)
3836 node_set_state(zone_to_nid(zone), N_NORMAL_MEMORY); 3848 node_set_state(zone_to_nid(zone), N_NORMAL_MEMORY);
3837 } 3849 }
3838 #endif 3850 #endif
3839 } 3851 }
3840 3852
3841 /** 3853 /**
3842 * free_area_init_nodes - Initialise all pg_data_t and zone data 3854 * free_area_init_nodes - Initialise all pg_data_t and zone data
3843 * @max_zone_pfn: an array of max PFNs for each zone 3855 * @max_zone_pfn: an array of max PFNs for each zone
3844 * 3856 *
3845 * This will call free_area_init_node() for each active node in the system. 3857 * This will call free_area_init_node() for each active node in the system.
3846 * Using the page ranges provided by add_active_range(), the size of each 3858 * Using the page ranges provided by add_active_range(), the size of each
3847 * zone in each node and their holes is calculated. If the maximum PFN 3859 * zone in each node and their holes is calculated. If the maximum PFN
3848 * between two adjacent zones match, it is assumed that the zone is empty. 3860 * between two adjacent zones match, it is assumed that the zone is empty.
3849 * For example, if arch_max_dma_pfn == arch_max_dma32_pfn, it is assumed 3861 * For example, if arch_max_dma_pfn == arch_max_dma32_pfn, it is assumed
3850 * that arch_max_dma32_pfn has no pages. It is also assumed that a zone 3862 * that arch_max_dma32_pfn has no pages. It is also assumed that a zone
3851 * starts where the previous one ended. For example, ZONE_DMA32 starts 3863 * starts where the previous one ended. For example, ZONE_DMA32 starts
3852 * at arch_max_dma_pfn. 3864 * at arch_max_dma_pfn.
3853 */ 3865 */
3854 void __init free_area_init_nodes(unsigned long *max_zone_pfn) 3866 void __init free_area_init_nodes(unsigned long *max_zone_pfn)
3855 { 3867 {
3856 unsigned long nid; 3868 unsigned long nid;
3857 enum zone_type i; 3869 enum zone_type i;
3858 3870
3859 /* Sort early_node_map as initialisation assumes it is sorted */ 3871 /* Sort early_node_map as initialisation assumes it is sorted */
3860 sort_node_map(); 3872 sort_node_map();
3861 3873
3862 /* Record where the zone boundaries are */ 3874 /* Record where the zone boundaries are */
3863 memset(arch_zone_lowest_possible_pfn, 0, 3875 memset(arch_zone_lowest_possible_pfn, 0,
3864 sizeof(arch_zone_lowest_possible_pfn)); 3876 sizeof(arch_zone_lowest_possible_pfn));
3865 memset(arch_zone_highest_possible_pfn, 0, 3877 memset(arch_zone_highest_possible_pfn, 0,
3866 sizeof(arch_zone_highest_possible_pfn)); 3878 sizeof(arch_zone_highest_possible_pfn));
3867 arch_zone_lowest_possible_pfn[0] = find_min_pfn_with_active_regions(); 3879 arch_zone_lowest_possible_pfn[0] = find_min_pfn_with_active_regions();
3868 arch_zone_highest_possible_pfn[0] = max_zone_pfn[0]; 3880 arch_zone_highest_possible_pfn[0] = max_zone_pfn[0];
3869 for (i = 1; i < MAX_NR_ZONES; i++) { 3881 for (i = 1; i < MAX_NR_ZONES; i++) {
3870 if (i == ZONE_MOVABLE) 3882 if (i == ZONE_MOVABLE)
3871 continue; 3883 continue;
3872 arch_zone_lowest_possible_pfn[i] = 3884 arch_zone_lowest_possible_pfn[i] =
3873 arch_zone_highest_possible_pfn[i-1]; 3885 arch_zone_highest_possible_pfn[i-1];
3874 arch_zone_highest_possible_pfn[i] = 3886 arch_zone_highest_possible_pfn[i] =
3875 max(max_zone_pfn[i], arch_zone_lowest_possible_pfn[i]); 3887 max(max_zone_pfn[i], arch_zone_lowest_possible_pfn[i]);
3876 } 3888 }
3877 arch_zone_lowest_possible_pfn[ZONE_MOVABLE] = 0; 3889 arch_zone_lowest_possible_pfn[ZONE_MOVABLE] = 0;
3878 arch_zone_highest_possible_pfn[ZONE_MOVABLE] = 0; 3890 arch_zone_highest_possible_pfn[ZONE_MOVABLE] = 0;
3879 3891
3880 /* Find the PFNs that ZONE_MOVABLE begins at in each node */ 3892 /* Find the PFNs that ZONE_MOVABLE begins at in each node */
3881 memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn)); 3893 memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn));
3882 find_zone_movable_pfns_for_nodes(zone_movable_pfn); 3894 find_zone_movable_pfns_for_nodes(zone_movable_pfn);
3883 3895
3884 /* Print out the zone ranges */ 3896 /* Print out the zone ranges */
3885 printk("Zone PFN ranges:\n"); 3897 printk("Zone PFN ranges:\n");
3886 for (i = 0; i < MAX_NR_ZONES; i++) { 3898 for (i = 0; i < MAX_NR_ZONES; i++) {
3887 if (i == ZONE_MOVABLE) 3899 if (i == ZONE_MOVABLE)
3888 continue; 3900 continue;
3889 printk(" %-8s %8lu -> %8lu\n", 3901 printk(" %-8s %8lu -> %8lu\n",
3890 zone_names[i], 3902 zone_names[i],
3891 arch_zone_lowest_possible_pfn[i], 3903 arch_zone_lowest_possible_pfn[i],
3892 arch_zone_highest_possible_pfn[i]); 3904 arch_zone_highest_possible_pfn[i]);
3893 } 3905 }
3894 3906
3895 /* Print out the PFNs ZONE_MOVABLE begins at in each node */ 3907 /* Print out the PFNs ZONE_MOVABLE begins at in each node */
3896 printk("Movable zone start PFN for each node\n"); 3908 printk("Movable zone start PFN for each node\n");
3897 for (i = 0; i < MAX_NUMNODES; i++) { 3909 for (i = 0; i < MAX_NUMNODES; i++) {
3898 if (zone_movable_pfn[i]) 3910 if (zone_movable_pfn[i])
3899 printk(" Node %d: %lu\n", i, zone_movable_pfn[i]); 3911 printk(" Node %d: %lu\n", i, zone_movable_pfn[i]);
3900 } 3912 }
3901 3913
3902 /* Print out the early_node_map[] */ 3914 /* Print out the early_node_map[] */
3903 printk("early_node_map[%d] active PFN ranges\n", nr_nodemap_entries); 3915 printk("early_node_map[%d] active PFN ranges\n", nr_nodemap_entries);
3904 for (i = 0; i < nr_nodemap_entries; i++) 3916 for (i = 0; i < nr_nodemap_entries; i++)
3905 printk(" %3d: %8lu -> %8lu\n", early_node_map[i].nid, 3917 printk(" %3d: %8lu -> %8lu\n", early_node_map[i].nid,
3906 early_node_map[i].start_pfn, 3918 early_node_map[i].start_pfn,
3907 early_node_map[i].end_pfn); 3919 early_node_map[i].end_pfn);
3908 3920
3909 /* Initialise every node */ 3921 /* Initialise every node */
3910 setup_nr_node_ids(); 3922 setup_nr_node_ids();
3911 for_each_online_node(nid) { 3923 for_each_online_node(nid) {
3912 pg_data_t *pgdat = NODE_DATA(nid); 3924 pg_data_t *pgdat = NODE_DATA(nid);
3913 free_area_init_node(nid, pgdat, NULL, 3925 free_area_init_node(nid, pgdat, NULL,
3914 find_min_pfn_for_node(nid), NULL); 3926 find_min_pfn_for_node(nid), NULL);
3915 3927
3916 /* Any memory on that node */ 3928 /* Any memory on that node */
3917 if (pgdat->node_present_pages) 3929 if (pgdat->node_present_pages)
3918 node_set_state(nid, N_HIGH_MEMORY); 3930 node_set_state(nid, N_HIGH_MEMORY);
3919 check_for_regular_memory(pgdat); 3931 check_for_regular_memory(pgdat);
3920 } 3932 }
3921 } 3933 }
3922 3934
3923 static int __init cmdline_parse_core(char *p, unsigned long *core) 3935 static int __init cmdline_parse_core(char *p, unsigned long *core)
3924 { 3936 {
3925 unsigned long long coremem; 3937 unsigned long long coremem;
3926 if (!p) 3938 if (!p)
3927 return -EINVAL; 3939 return -EINVAL;
3928 3940
3929 coremem = memparse(p, &p); 3941 coremem = memparse(p, &p);
3930 *core = coremem >> PAGE_SHIFT; 3942 *core = coremem >> PAGE_SHIFT;
3931 3943
3932 /* Paranoid check that UL is enough for the coremem value */ 3944 /* Paranoid check that UL is enough for the coremem value */
3933 WARN_ON((coremem >> PAGE_SHIFT) > ULONG_MAX); 3945 WARN_ON((coremem >> PAGE_SHIFT) > ULONG_MAX);
3934 3946
3935 return 0; 3947 return 0;
3936 } 3948 }
3937 3949
3938 /* 3950 /*
3939 * kernelcore=size sets the amount of memory for use for allocations that 3951 * kernelcore=size sets the amount of memory for use for allocations that
3940 * cannot be reclaimed or migrated. 3952 * cannot be reclaimed or migrated.
3941 */ 3953 */
3942 static int __init cmdline_parse_kernelcore(char *p) 3954 static int __init cmdline_parse_kernelcore(char *p)
3943 { 3955 {
3944 return cmdline_parse_core(p, &required_kernelcore); 3956 return cmdline_parse_core(p, &required_kernelcore);
3945 } 3957 }
3946 3958
3947 /* 3959 /*
3948 * movablecore=size sets the amount of memory for use for allocations that 3960 * movablecore=size sets the amount of memory for use for allocations that
3949 * can be reclaimed or migrated. 3961 * can be reclaimed or migrated.
3950 */ 3962 */
3951 static int __init cmdline_parse_movablecore(char *p) 3963 static int __init cmdline_parse_movablecore(char *p)
3952 { 3964 {
3953 return cmdline_parse_core(p, &required_movablecore); 3965 return cmdline_parse_core(p, &required_movablecore);
3954 } 3966 }
3955 3967
3956 early_param("kernelcore", cmdline_parse_kernelcore); 3968 early_param("kernelcore", cmdline_parse_kernelcore);
3957 early_param("movablecore", cmdline_parse_movablecore); 3969 early_param("movablecore", cmdline_parse_movablecore);
3958 3970
3959 #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */ 3971 #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
3960 3972
3961 /** 3973 /**
3962 * set_dma_reserve - set the specified number of pages reserved in the first zone 3974 * set_dma_reserve - set the specified number of pages reserved in the first zone
3963 * @new_dma_reserve: The number of pages to mark reserved 3975 * @new_dma_reserve: The number of pages to mark reserved
3964 * 3976 *
3965 * The per-cpu batchsize and zone watermarks are determined by present_pages. 3977 * The per-cpu batchsize and zone watermarks are determined by present_pages.
3966 * In the DMA zone, a significant percentage may be consumed by kernel image 3978 * In the DMA zone, a significant percentage may be consumed by kernel image
3967 * and other unfreeable allocations which can skew the watermarks badly. This 3979 * and other unfreeable allocations which can skew the watermarks badly. This
3968 * function may optionally be used to account for unfreeable pages in the 3980 * function may optionally be used to account for unfreeable pages in the
3969 * first zone (e.g., ZONE_DMA). The effect will be lower watermarks and 3981 * first zone (e.g., ZONE_DMA). The effect will be lower watermarks and
3970 * smaller per-cpu batchsize. 3982 * smaller per-cpu batchsize.
3971 */ 3983 */
3972 void __init set_dma_reserve(unsigned long new_dma_reserve) 3984 void __init set_dma_reserve(unsigned long new_dma_reserve)
3973 { 3985 {
3974 dma_reserve = new_dma_reserve; 3986 dma_reserve = new_dma_reserve;
3975 } 3987 }
3976 3988
3977 #ifndef CONFIG_NEED_MULTIPLE_NODES 3989 #ifndef CONFIG_NEED_MULTIPLE_NODES
3978 static bootmem_data_t contig_bootmem_data; 3990 static bootmem_data_t contig_bootmem_data;
3979 struct pglist_data contig_page_data = { .bdata = &contig_bootmem_data }; 3991 struct pglist_data contig_page_data = { .bdata = &contig_bootmem_data };
3980 3992
3981 EXPORT_SYMBOL(contig_page_data); 3993 EXPORT_SYMBOL(contig_page_data);
3982 #endif 3994 #endif
3983 3995
3984 void __init free_area_init(unsigned long *zones_size) 3996 void __init free_area_init(unsigned long *zones_size)
3985 { 3997 {
3986 free_area_init_node(0, NODE_DATA(0), zones_size, 3998 free_area_init_node(0, NODE_DATA(0), zones_size,
3987 __pa(PAGE_OFFSET) >> PAGE_SHIFT, NULL); 3999 __pa(PAGE_OFFSET) >> PAGE_SHIFT, NULL);
3988 } 4000 }
3989 4001
3990 static int page_alloc_cpu_notify(struct notifier_block *self, 4002 static int page_alloc_cpu_notify(struct notifier_block *self,
3991 unsigned long action, void *hcpu) 4003 unsigned long action, void *hcpu)
3992 { 4004 {
3993 int cpu = (unsigned long)hcpu; 4005 int cpu = (unsigned long)hcpu;
3994 4006
3995 if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) { 4007 if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) {
3996 drain_pages(cpu); 4008 drain_pages(cpu);
3997 4009
3998 /* 4010 /*
3999 * Spill the event counters of the dead processor 4011 * Spill the event counters of the dead processor
4000 * into the current processors event counters. 4012 * into the current processors event counters.
4001 * This artificially elevates the count of the current 4013 * This artificially elevates the count of the current
4002 * processor. 4014 * processor.
4003 */ 4015 */
4004 vm_events_fold_cpu(cpu); 4016 vm_events_fold_cpu(cpu);
4005 4017
4006 /* 4018 /*
4007 * Zero the differential counters of the dead processor 4019 * Zero the differential counters of the dead processor
4008 * so that the vm statistics are consistent. 4020 * so that the vm statistics are consistent.
4009 * 4021 *
4010 * This is only okay since the processor is dead and cannot 4022 * This is only okay since the processor is dead and cannot
4011 * race with what we are doing. 4023 * race with what we are doing.
4012 */ 4024 */
4013 refresh_cpu_vm_stats(cpu); 4025 refresh_cpu_vm_stats(cpu);
4014 } 4026 }
4015 return NOTIFY_OK; 4027 return NOTIFY_OK;
4016 } 4028 }
4017 4029
4018 void __init page_alloc_init(void) 4030 void __init page_alloc_init(void)
4019 { 4031 {
4020 hotcpu_notifier(page_alloc_cpu_notify, 0); 4032 hotcpu_notifier(page_alloc_cpu_notify, 0);
4021 } 4033 }
4022 4034
4023 /* 4035 /*
4024 * calculate_totalreserve_pages - called when sysctl_lower_zone_reserve_ratio 4036 * calculate_totalreserve_pages - called when sysctl_lower_zone_reserve_ratio
4025 * or min_free_kbytes changes. 4037 * or min_free_kbytes changes.
4026 */ 4038 */
4027 static void calculate_totalreserve_pages(void) 4039 static void calculate_totalreserve_pages(void)
4028 { 4040 {
4029 struct pglist_data *pgdat; 4041 struct pglist_data *pgdat;
4030 unsigned long reserve_pages = 0; 4042 unsigned long reserve_pages = 0;
4031 enum zone_type i, j; 4043 enum zone_type i, j;
4032 4044
4033 for_each_online_pgdat(pgdat) { 4045 for_each_online_pgdat(pgdat) {
4034 for (i = 0; i < MAX_NR_ZONES; i++) { 4046 for (i = 0; i < MAX_NR_ZONES; i++) {
4035 struct zone *zone = pgdat->node_zones + i; 4047 struct zone *zone = pgdat->node_zones + i;
4036 unsigned long max = 0; 4048 unsigned long max = 0;
4037 4049
4038 /* Find valid and maximum lowmem_reserve in the zone */ 4050 /* Find valid and maximum lowmem_reserve in the zone */
4039 for (j = i; j < MAX_NR_ZONES; j++) { 4051 for (j = i; j < MAX_NR_ZONES; j++) {
4040 if (zone->lowmem_reserve[j] > max) 4052 if (zone->lowmem_reserve[j] > max)
4041 max = zone->lowmem_reserve[j]; 4053 max = zone->lowmem_reserve[j];
4042 } 4054 }
4043 4055
4044 /* we treat pages_high as reserved pages. */ 4056 /* we treat pages_high as reserved pages. */
4045 max += zone->pages_high; 4057 max += zone->pages_high;
4046 4058
4047 if (max > zone->present_pages) 4059 if (max > zone->present_pages)
4048 max = zone->present_pages; 4060 max = zone->present_pages;
4049 reserve_pages += max; 4061 reserve_pages += max;
4050 } 4062 }
4051 } 4063 }
4052 totalreserve_pages = reserve_pages; 4064 totalreserve_pages = reserve_pages;
4053 } 4065 }
4054 4066
4055 /* 4067 /*
4056 * setup_per_zone_lowmem_reserve - called whenever 4068 * setup_per_zone_lowmem_reserve - called whenever
4057 * sysctl_lower_zone_reserve_ratio changes. Ensures that each zone 4069 * sysctl_lower_zone_reserve_ratio changes. Ensures that each zone
4058 * has a correct pages reserved value, so an adequate number of 4070 * has a correct pages reserved value, so an adequate number of
4059 * pages are left in the zone after a successful __alloc_pages(). 4071 * pages are left in the zone after a successful __alloc_pages().
4060 */ 4072 */
4061 static void setup_per_zone_lowmem_reserve(void) 4073 static void setup_per_zone_lowmem_reserve(void)
4062 { 4074 {
4063 struct pglist_data *pgdat; 4075 struct pglist_data *pgdat;
4064 enum zone_type j, idx; 4076 enum zone_type j, idx;
4065 4077
4066 for_each_online_pgdat(pgdat) { 4078 for_each_online_pgdat(pgdat) {
4067 for (j = 0; j < MAX_NR_ZONES; j++) { 4079 for (j = 0; j < MAX_NR_ZONES; j++) {
4068 struct zone *zone = pgdat->node_zones + j; 4080 struct zone *zone = pgdat->node_zones + j;
4069 unsigned long present_pages = zone->present_pages; 4081 unsigned long present_pages = zone->present_pages;
4070 4082
4071 zone->lowmem_reserve[j] = 0; 4083 zone->lowmem_reserve[j] = 0;
4072 4084
4073 idx = j; 4085 idx = j;
4074 while (idx) { 4086 while (idx) {
4075 struct zone *lower_zone; 4087 struct zone *lower_zone;
4076 4088
4077 idx--; 4089 idx--;
4078 4090
4079 if (sysctl_lowmem_reserve_ratio[idx] < 1) 4091 if (sysctl_lowmem_reserve_ratio[idx] < 1)
4080 sysctl_lowmem_reserve_ratio[idx] = 1; 4092 sysctl_lowmem_reserve_ratio[idx] = 1;
4081 4093
4082 lower_zone = pgdat->node_zones + idx; 4094 lower_zone = pgdat->node_zones + idx;
4083 lower_zone->lowmem_reserve[j] = present_pages / 4095 lower_zone->lowmem_reserve[j] = present_pages /
4084 sysctl_lowmem_reserve_ratio[idx]; 4096 sysctl_lowmem_reserve_ratio[idx];
4085 present_pages += lower_zone->present_pages; 4097 present_pages += lower_zone->present_pages;
4086 } 4098 }
4087 } 4099 }
4088 } 4100 }
4089 4101
4090 /* update totalreserve_pages */ 4102 /* update totalreserve_pages */
4091 calculate_totalreserve_pages(); 4103 calculate_totalreserve_pages();
4092 } 4104 }
4093 4105
4094 /** 4106 /**
4095 * setup_per_zone_pages_min - called when min_free_kbytes changes. 4107 * setup_per_zone_pages_min - called when min_free_kbytes changes.
4096 * 4108 *
4097 * Ensures that the pages_{min,low,high} values for each zone are set correctly 4109 * Ensures that the pages_{min,low,high} values for each zone are set correctly
4098 * with respect to min_free_kbytes. 4110 * with respect to min_free_kbytes.
4099 */ 4111 */
4100 void setup_per_zone_pages_min(void) 4112 void setup_per_zone_pages_min(void)
4101 { 4113 {
4102 unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10); 4114 unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
4103 unsigned long lowmem_pages = 0; 4115 unsigned long lowmem_pages = 0;
4104 struct zone *zone; 4116 struct zone *zone;
4105 unsigned long flags; 4117 unsigned long flags;
4106 4118
4107 /* Calculate total number of !ZONE_HIGHMEM pages */ 4119 /* Calculate total number of !ZONE_HIGHMEM pages */
4108 for_each_zone(zone) { 4120 for_each_zone(zone) {
4109 if (!is_highmem(zone)) 4121 if (!is_highmem(zone))
4110 lowmem_pages += zone->present_pages; 4122 lowmem_pages += zone->present_pages;
4111 } 4123 }
4112 4124
4113 for_each_zone(zone) { 4125 for_each_zone(zone) {
4114 u64 tmp; 4126 u64 tmp;
4115 4127
4116 spin_lock_irqsave(&zone->lru_lock, flags); 4128 spin_lock_irqsave(&zone->lru_lock, flags);
4117 tmp = (u64)pages_min * zone->present_pages; 4129 tmp = (u64)pages_min * zone->present_pages;
4118 do_div(tmp, lowmem_pages); 4130 do_div(tmp, lowmem_pages);
4119 if (is_highmem(zone)) { 4131 if (is_highmem(zone)) {
4120 /* 4132 /*
4121 * __GFP_HIGH and PF_MEMALLOC allocations usually don't 4133 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
4122 * need highmem pages, so cap pages_min to a small 4134 * need highmem pages, so cap pages_min to a small
4123 * value here. 4135 * value here.
4124 * 4136 *
4125 * The (pages_high-pages_low) and (pages_low-pages_min) 4137 * The (pages_high-pages_low) and (pages_low-pages_min)
4126 * deltas controls asynch page reclaim, and so should 4138 * deltas controls asynch page reclaim, and so should
4127 * not be capped for highmem. 4139 * not be capped for highmem.
4128 */ 4140 */
4129 int min_pages; 4141 int min_pages;
4130 4142
4131 min_pages = zone->present_pages / 1024; 4143 min_pages = zone->present_pages / 1024;
4132 if (min_pages < SWAP_CLUSTER_MAX) 4144 if (min_pages < SWAP_CLUSTER_MAX)
4133 min_pages = SWAP_CLUSTER_MAX; 4145 min_pages = SWAP_CLUSTER_MAX;
4134 if (min_pages > 128) 4146 if (min_pages > 128)
4135 min_pages = 128; 4147 min_pages = 128;
4136 zone->pages_min = min_pages; 4148 zone->pages_min = min_pages;
4137 } else { 4149 } else {
4138 /* 4150 /*
4139 * If it's a lowmem zone, reserve a number of pages 4151 * If it's a lowmem zone, reserve a number of pages
4140 * proportionate to the zone's size. 4152 * proportionate to the zone's size.
4141 */ 4153 */
4142 zone->pages_min = tmp; 4154 zone->pages_min = tmp;
4143 } 4155 }
4144 4156
4145 zone->pages_low = zone->pages_min + (tmp >> 2); 4157 zone->pages_low = zone->pages_min + (tmp >> 2);
4146 zone->pages_high = zone->pages_min + (tmp >> 1); 4158 zone->pages_high = zone->pages_min + (tmp >> 1);
4147 setup_zone_migrate_reserve(zone); 4159 setup_zone_migrate_reserve(zone);
4148 spin_unlock_irqrestore(&zone->lru_lock, flags); 4160 spin_unlock_irqrestore(&zone->lru_lock, flags);
4149 } 4161 }
4150 4162
4151 /* update totalreserve_pages */ 4163 /* update totalreserve_pages */
4152 calculate_totalreserve_pages(); 4164 calculate_totalreserve_pages();
4153 } 4165 }
4154 4166
4155 /* 4167 /*
4156 * Initialise min_free_kbytes. 4168 * Initialise min_free_kbytes.
4157 * 4169 *
4158 * For small machines we want it small (128k min). For large machines 4170 * For small machines we want it small (128k min). For large machines
4159 * we want it large (64MB max). But it is not linear, because network 4171 * we want it large (64MB max). But it is not linear, because network
4160 * bandwidth does not increase linearly with machine size. We use 4172 * bandwidth does not increase linearly with machine size. We use
4161 * 4173 *
4162 * min_free_kbytes = 4 * sqrt(lowmem_kbytes), for better accuracy: 4174 * min_free_kbytes = 4 * sqrt(lowmem_kbytes), for better accuracy:
4163 * min_free_kbytes = sqrt(lowmem_kbytes * 16) 4175 * min_free_kbytes = sqrt(lowmem_kbytes * 16)
4164 * 4176 *
4165 * which yields 4177 * which yields
4166 * 4178 *
4167 * 16MB: 512k 4179 * 16MB: 512k
4168 * 32MB: 724k 4180 * 32MB: 724k
4169 * 64MB: 1024k 4181 * 64MB: 1024k
4170 * 128MB: 1448k 4182 * 128MB: 1448k
4171 * 256MB: 2048k 4183 * 256MB: 2048k
4172 * 512MB: 2896k 4184 * 512MB: 2896k
4173 * 1024MB: 4096k 4185 * 1024MB: 4096k
4174 * 2048MB: 5792k 4186 * 2048MB: 5792k
4175 * 4096MB: 8192k 4187 * 4096MB: 8192k
4176 * 8192MB: 11584k 4188 * 8192MB: 11584k
4177 * 16384MB: 16384k 4189 * 16384MB: 16384k
4178 */ 4190 */
4179 static int __init init_per_zone_pages_min(void) 4191 static int __init init_per_zone_pages_min(void)
4180 { 4192 {
4181 unsigned long lowmem_kbytes; 4193 unsigned long lowmem_kbytes;
4182 4194
4183 lowmem_kbytes = nr_free_buffer_pages() * (PAGE_SIZE >> 10); 4195 lowmem_kbytes = nr_free_buffer_pages() * (PAGE_SIZE >> 10);
4184 4196
4185 min_free_kbytes = int_sqrt(lowmem_kbytes * 16); 4197 min_free_kbytes = int_sqrt(lowmem_kbytes * 16);
4186 if (min_free_kbytes < 128) 4198 if (min_free_kbytes < 128)
4187 min_free_kbytes = 128; 4199 min_free_kbytes = 128;
4188 if (min_free_kbytes > 65536) 4200 if (min_free_kbytes > 65536)
4189 min_free_kbytes = 65536; 4201 min_free_kbytes = 65536;
4190 setup_per_zone_pages_min(); 4202 setup_per_zone_pages_min();
4191 setup_per_zone_lowmem_reserve(); 4203 setup_per_zone_lowmem_reserve();
4192 return 0; 4204 return 0;
4193 } 4205 }
4194 module_init(init_per_zone_pages_min) 4206 module_init(init_per_zone_pages_min)
4195 4207
4196 /* 4208 /*
4197 * min_free_kbytes_sysctl_handler - just a wrapper around proc_dointvec() so 4209 * min_free_kbytes_sysctl_handler - just a wrapper around proc_dointvec() so
4198 * that we can call two helper functions whenever min_free_kbytes 4210 * that we can call two helper functions whenever min_free_kbytes
4199 * changes. 4211 * changes.
4200 */ 4212 */
4201 int min_free_kbytes_sysctl_handler(ctl_table *table, int write, 4213 int min_free_kbytes_sysctl_handler(ctl_table *table, int write,
4202 struct file *file, void __user *buffer, size_t *length, loff_t *ppos) 4214 struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
4203 { 4215 {
4204 proc_dointvec(table, write, file, buffer, length, ppos); 4216 proc_dointvec(table, write, file, buffer, length, ppos);
4205 if (write) 4217 if (write)
4206 setup_per_zone_pages_min(); 4218 setup_per_zone_pages_min();
4207 return 0; 4219 return 0;
4208 } 4220 }
4209 4221
4210 #ifdef CONFIG_NUMA 4222 #ifdef CONFIG_NUMA
4211 int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write, 4223 int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,
4212 struct file *file, void __user *buffer, size_t *length, loff_t *ppos) 4224 struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
4213 { 4225 {
4214 struct zone *zone; 4226 struct zone *zone;
4215 int rc; 4227 int rc;
4216 4228
4217 rc = proc_dointvec_minmax(table, write, file, buffer, length, ppos); 4229 rc = proc_dointvec_minmax(table, write, file, buffer, length, ppos);
4218 if (rc) 4230 if (rc)
4219 return rc; 4231 return rc;
4220 4232
4221 for_each_zone(zone) 4233 for_each_zone(zone)
4222 zone->min_unmapped_pages = (zone->present_pages * 4234 zone->min_unmapped_pages = (zone->present_pages *
4223 sysctl_min_unmapped_ratio) / 100; 4235 sysctl_min_unmapped_ratio) / 100;
4224 return 0; 4236 return 0;
4225 } 4237 }
4226 4238
4227 int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write, 4239 int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write,
4228 struct file *file, void __user *buffer, size_t *length, loff_t *ppos) 4240 struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
4229 { 4241 {
4230 struct zone *zone; 4242 struct zone *zone;
4231 int rc; 4243 int rc;
4232 4244
4233 rc = proc_dointvec_minmax(table, write, file, buffer, length, ppos); 4245 rc = proc_dointvec_minmax(table, write, file, buffer, length, ppos);
4234 if (rc) 4246 if (rc)
4235 return rc; 4247 return rc;
4236 4248
4237 for_each_zone(zone) 4249 for_each_zone(zone)
4238 zone->min_slab_pages = (zone->present_pages * 4250 zone->min_slab_pages = (zone->present_pages *
4239 sysctl_min_slab_ratio) / 100; 4251 sysctl_min_slab_ratio) / 100;
4240 return 0; 4252 return 0;
4241 } 4253 }
4242 #endif 4254 #endif
4243 4255
4244 /* 4256 /*
4245 * lowmem_reserve_ratio_sysctl_handler - just a wrapper around 4257 * lowmem_reserve_ratio_sysctl_handler - just a wrapper around
4246 * proc_dointvec() so that we can call setup_per_zone_lowmem_reserve() 4258 * proc_dointvec() so that we can call setup_per_zone_lowmem_reserve()
4247 * whenever sysctl_lowmem_reserve_ratio changes. 4259 * whenever sysctl_lowmem_reserve_ratio changes.
4248 * 4260 *
4249 * The reserve ratio obviously has absolutely no relation with the 4261 * The reserve ratio obviously has absolutely no relation with the
4250 * pages_min watermarks. The lowmem reserve ratio can only make sense 4262 * pages_min watermarks. The lowmem reserve ratio can only make sense
4251 * if in function of the boot time zone sizes. 4263 * if in function of the boot time zone sizes.
4252 */ 4264 */
4253 int lowmem_reserve_ratio_sysctl_handler(ctl_table *table, int write, 4265 int lowmem_reserve_ratio_sysctl_handler(ctl_table *table, int write,
4254 struct file *file, void __user *buffer, size_t *length, loff_t *ppos) 4266 struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
4255 { 4267 {
4256 proc_dointvec_minmax(table, write, file, buffer, length, ppos); 4268 proc_dointvec_minmax(table, write, file, buffer, length, ppos);
4257 setup_per_zone_lowmem_reserve(); 4269 setup_per_zone_lowmem_reserve();
4258 return 0; 4270 return 0;
4259 } 4271 }
4260 4272
4261 /* 4273 /*
4262 * percpu_pagelist_fraction - changes the pcp->high for each zone on each 4274 * percpu_pagelist_fraction - changes the pcp->high for each zone on each
4263 * cpu. It is the fraction of total pages in each zone that a hot per cpu pagelist 4275 * cpu. It is the fraction of total pages in each zone that a hot per cpu pagelist
4264 * can have before it gets flushed back to buddy allocator. 4276 * can have before it gets flushed back to buddy allocator.
4265 */ 4277 */
4266 4278
4267 int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write, 4279 int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
4268 struct file *file, void __user *buffer, size_t *length, loff_t *ppos) 4280 struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
4269 { 4281 {
4270 struct zone *zone; 4282 struct zone *zone;
4271 unsigned int cpu; 4283 unsigned int cpu;
4272 int ret; 4284 int ret;
4273 4285
4274 ret = proc_dointvec_minmax(table, write, file, buffer, length, ppos); 4286 ret = proc_dointvec_minmax(table, write, file, buffer, length, ppos);
4275 if (!write || (ret == -EINVAL)) 4287 if (!write || (ret == -EINVAL))
4276 return ret; 4288 return ret;
4277 for_each_zone(zone) { 4289 for_each_zone(zone) {
4278 for_each_online_cpu(cpu) { 4290 for_each_online_cpu(cpu) {
4279 unsigned long high; 4291 unsigned long high;
4280 high = zone->present_pages / percpu_pagelist_fraction; 4292 high = zone->present_pages / percpu_pagelist_fraction;
4281 setup_pagelist_highmark(zone_pcp(zone, cpu), high); 4293 setup_pagelist_highmark(zone_pcp(zone, cpu), high);
4282 } 4294 }
4283 } 4295 }
4284 return 0; 4296 return 0;
4285 } 4297 }
4286 4298
4287 int hashdist = HASHDIST_DEFAULT; 4299 int hashdist = HASHDIST_DEFAULT;
4288 4300
4289 #ifdef CONFIG_NUMA 4301 #ifdef CONFIG_NUMA
4290 static int __init set_hashdist(char *str) 4302 static int __init set_hashdist(char *str)
4291 { 4303 {
4292 if (!str) 4304 if (!str)
4293 return 0; 4305 return 0;
4294 hashdist = simple_strtoul(str, &str, 0); 4306 hashdist = simple_strtoul(str, &str, 0);
4295 return 1; 4307 return 1;
4296 } 4308 }
4297 __setup("hashdist=", set_hashdist); 4309 __setup("hashdist=", set_hashdist);
4298 #endif 4310 #endif
4299 4311
4300 /* 4312 /*
4301 * allocate a large system hash table from bootmem 4313 * allocate a large system hash table from bootmem
4302 * - it is assumed that the hash table must contain an exact power-of-2 4314 * - it is assumed that the hash table must contain an exact power-of-2
4303 * quantity of entries 4315 * quantity of entries
4304 * - limit is the number of hash buckets, not the total allocation size 4316 * - limit is the number of hash buckets, not the total allocation size
4305 */ 4317 */
4306 void *__init alloc_large_system_hash(const char *tablename, 4318 void *__init alloc_large_system_hash(const char *tablename,
4307 unsigned long bucketsize, 4319 unsigned long bucketsize,
4308 unsigned long numentries, 4320 unsigned long numentries,
4309 int scale, 4321 int scale,
4310 int flags, 4322 int flags,
4311 unsigned int *_hash_shift, 4323 unsigned int *_hash_shift,
4312 unsigned int *_hash_mask, 4324 unsigned int *_hash_mask,
4313 unsigned long limit) 4325 unsigned long limit)
4314 { 4326 {
4315 unsigned long long max = limit; 4327 unsigned long long max = limit;
4316 unsigned long log2qty, size; 4328 unsigned long log2qty, size;
4317 void *table = NULL; 4329 void *table = NULL;
4318 4330
4319 /* allow the kernel cmdline to have a say */ 4331 /* allow the kernel cmdline to have a say */
4320 if (!numentries) { 4332 if (!numentries) {
4321 /* round applicable memory size up to nearest megabyte */ 4333 /* round applicable memory size up to nearest megabyte */
4322 numentries = nr_kernel_pages; 4334 numentries = nr_kernel_pages;
4323 numentries += (1UL << (20 - PAGE_SHIFT)) - 1; 4335 numentries += (1UL << (20 - PAGE_SHIFT)) - 1;
4324 numentries >>= 20 - PAGE_SHIFT; 4336 numentries >>= 20 - PAGE_SHIFT;
4325 numentries <<= 20 - PAGE_SHIFT; 4337 numentries <<= 20 - PAGE_SHIFT;
4326 4338
4327 /* limit to 1 bucket per 2^scale bytes of low memory */ 4339 /* limit to 1 bucket per 2^scale bytes of low memory */
4328 if (scale > PAGE_SHIFT) 4340 if (scale > PAGE_SHIFT)
4329 numentries >>= (scale - PAGE_SHIFT); 4341 numentries >>= (scale - PAGE_SHIFT);
4330 else 4342 else
4331 numentries <<= (PAGE_SHIFT - scale); 4343 numentries <<= (PAGE_SHIFT - scale);
4332 4344
4333 /* Make sure we've got at least a 0-order allocation.. */ 4345 /* Make sure we've got at least a 0-order allocation.. */
4334 if (unlikely((numentries * bucketsize) < PAGE_SIZE)) 4346 if (unlikely((numentries * bucketsize) < PAGE_SIZE))
4335 numentries = PAGE_SIZE / bucketsize; 4347 numentries = PAGE_SIZE / bucketsize;
4336 } 4348 }
4337 numentries = roundup_pow_of_two(numentries); 4349 numentries = roundup_pow_of_two(numentries);
4338 4350
4339 /* limit allocation size to 1/16 total memory by default */ 4351 /* limit allocation size to 1/16 total memory by default */
4340 if (max == 0) { 4352 if (max == 0) {
4341 max = ((unsigned long long)nr_all_pages << PAGE_SHIFT) >> 4; 4353 max = ((unsigned long long)nr_all_pages << PAGE_SHIFT) >> 4;
4342 do_div(max, bucketsize); 4354 do_div(max, bucketsize);
4343 } 4355 }
4344 4356
4345 if (numentries > max) 4357 if (numentries > max)
4346 numentries = max; 4358 numentries = max;
4347 4359
4348 log2qty = ilog2(numentries); 4360 log2qty = ilog2(numentries);
4349 4361
4350 do { 4362 do {
4351 size = bucketsize << log2qty; 4363 size = bucketsize << log2qty;
4352 if (flags & HASH_EARLY) 4364 if (flags & HASH_EARLY)
4353 table = alloc_bootmem(size); 4365 table = alloc_bootmem(size);
4354 else if (hashdist) 4366 else if (hashdist)
4355 table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL); 4367 table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL);
4356 else { 4368 else {
4357 unsigned long order = get_order(size); 4369 unsigned long order = get_order(size);
4358 table = (void*) __get_free_pages(GFP_ATOMIC, order); 4370 table = (void*) __get_free_pages(GFP_ATOMIC, order);
4359 /* 4371 /*
4360 * If bucketsize is not a power-of-two, we may free 4372 * If bucketsize is not a power-of-two, we may free
4361 * some pages at the end of hash table. 4373 * some pages at the end of hash table.
4362 */ 4374 */
4363 if (table) { 4375 if (table) {
4364 unsigned long alloc_end = (unsigned long)table + 4376 unsigned long alloc_end = (unsigned long)table +
4365 (PAGE_SIZE << order); 4377 (PAGE_SIZE << order);
4366 unsigned long used = (unsigned long)table + 4378 unsigned long used = (unsigned long)table +
4367 PAGE_ALIGN(size); 4379 PAGE_ALIGN(size);
4368 split_page(virt_to_page(table), order); 4380 split_page(virt_to_page(table), order);
4369 while (used < alloc_end) { 4381 while (used < alloc_end) {
4370 free_page(used); 4382 free_page(used);
4371 used += PAGE_SIZE; 4383 used += PAGE_SIZE;
4372 } 4384 }
4373 } 4385 }
4374 } 4386 }
4375 } while (!table && size > PAGE_SIZE && --log2qty); 4387 } while (!table && size > PAGE_SIZE && --log2qty);
4376 4388
4377 if (!table) 4389 if (!table)
4378 panic("Failed to allocate %s hash table\n", tablename); 4390 panic("Failed to allocate %s hash table\n", tablename);
4379 4391
4380 printk(KERN_INFO "%s hash table entries: %d (order: %d, %lu bytes)\n", 4392 printk(KERN_INFO "%s hash table entries: %d (order: %d, %lu bytes)\n",
4381 tablename, 4393 tablename,
4382 (1U << log2qty), 4394 (1U << log2qty),
4383 ilog2(size) - PAGE_SHIFT, 4395 ilog2(size) - PAGE_SHIFT,
4384 size); 4396 size);
4385 4397
4386 if (_hash_shift) 4398 if (_hash_shift)
4387 *_hash_shift = log2qty; 4399 *_hash_shift = log2qty;
4388 if (_hash_mask) 4400 if (_hash_mask)
4389 *_hash_mask = (1 << log2qty) - 1; 4401 *_hash_mask = (1 << log2qty) - 1;
4390 4402
4391 return table; 4403 return table;
4392 } 4404 }
4393 4405
4394 #ifdef CONFIG_OUT_OF_LINE_PFN_TO_PAGE 4406 #ifdef CONFIG_OUT_OF_LINE_PFN_TO_PAGE
4395 struct page *pfn_to_page(unsigned long pfn) 4407 struct page *pfn_to_page(unsigned long pfn)
4396 { 4408 {
4397 return __pfn_to_page(pfn); 4409 return __pfn_to_page(pfn);
4398 } 4410 }
4399 unsigned long page_to_pfn(struct page *page) 4411 unsigned long page_to_pfn(struct page *page)
4400 { 4412 {
4401 return __page_to_pfn(page); 4413 return __page_to_pfn(page);
4402 } 4414 }
4403 EXPORT_SYMBOL(pfn_to_page); 4415 EXPORT_SYMBOL(pfn_to_page);
4404 EXPORT_SYMBOL(page_to_pfn); 4416 EXPORT_SYMBOL(page_to_pfn);
4405 #endif /* CONFIG_OUT_OF_LINE_PFN_TO_PAGE */ 4417 #endif /* CONFIG_OUT_OF_LINE_PFN_TO_PAGE */
4406 4418
4407 /* Return a pointer to the bitmap storing bits affecting a block of pages */ 4419 /* Return a pointer to the bitmap storing bits affecting a block of pages */
4408 static inline unsigned long *get_pageblock_bitmap(struct zone *zone, 4420 static inline unsigned long *get_pageblock_bitmap(struct zone *zone,
4409 unsigned long pfn) 4421 unsigned long pfn)
4410 { 4422 {
4411 #ifdef CONFIG_SPARSEMEM 4423 #ifdef CONFIG_SPARSEMEM
4412 return __pfn_to_section(pfn)->pageblock_flags; 4424 return __pfn_to_section(pfn)->pageblock_flags;
4413 #else 4425 #else
4414 return zone->pageblock_flags; 4426 return zone->pageblock_flags;
4415 #endif /* CONFIG_SPARSEMEM */ 4427 #endif /* CONFIG_SPARSEMEM */
4416 } 4428 }
4417 4429
4418 static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn) 4430 static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
4419 { 4431 {
4420 #ifdef CONFIG_SPARSEMEM 4432 #ifdef CONFIG_SPARSEMEM
4421 pfn &= (PAGES_PER_SECTION-1); 4433 pfn &= (PAGES_PER_SECTION-1);
4422 return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS; 4434 return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS;
4423 #else 4435 #else
4424 pfn = pfn - zone->zone_start_pfn; 4436 pfn = pfn - zone->zone_start_pfn;
4425 return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS; 4437 return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS;
4426 #endif /* CONFIG_SPARSEMEM */ 4438 #endif /* CONFIG_SPARSEMEM */
4427 } 4439 }
4428 4440
4429 /** 4441 /**
4430 * get_pageblock_flags_group - Return the requested group of flags for the pageblock_nr_pages block of pages 4442 * get_pageblock_flags_group - Return the requested group of flags for the pageblock_nr_pages block of pages
4431 * @page: The page within the block of interest 4443 * @page: The page within the block of interest
4432 * @start_bitidx: The first bit of interest to retrieve 4444 * @start_bitidx: The first bit of interest to retrieve
4433 * @end_bitidx: The last bit of interest 4445 * @end_bitidx: The last bit of interest
4434 * returns pageblock_bits flags 4446 * returns pageblock_bits flags
4435 */ 4447 */
4436 unsigned long get_pageblock_flags_group(struct page *page, 4448 unsigned long get_pageblock_flags_group(struct page *page,
4437 int start_bitidx, int end_bitidx) 4449 int start_bitidx, int end_bitidx)
4438 { 4450 {
4439 struct zone *zone; 4451 struct zone *zone;
4440 unsigned long *bitmap; 4452 unsigned long *bitmap;
4441 unsigned long pfn, bitidx; 4453 unsigned long pfn, bitidx;
4442 unsigned long flags = 0; 4454 unsigned long flags = 0;
4443 unsigned long value = 1; 4455 unsigned long value = 1;
4444 4456
4445 zone = page_zone(page); 4457 zone = page_zone(page);
4446 pfn = page_to_pfn(page); 4458 pfn = page_to_pfn(page);
4447 bitmap = get_pageblock_bitmap(zone, pfn); 4459 bitmap = get_pageblock_bitmap(zone, pfn);
4448 bitidx = pfn_to_bitidx(zone, pfn); 4460 bitidx = pfn_to_bitidx(zone, pfn);
4449 4461
4450 for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1) 4462 for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
4451 if (test_bit(bitidx + start_bitidx, bitmap)) 4463 if (test_bit(bitidx + start_bitidx, bitmap))
4452 flags |= value; 4464 flags |= value;
4453 4465
4454 return flags; 4466 return flags;
4455 } 4467 }
4456 4468
4457 /** 4469 /**
4458 * set_pageblock_flags_group - Set the requested group of flags for a pageblock_nr_pages block of pages 4470 * set_pageblock_flags_group - Set the requested group of flags for a pageblock_nr_pages block of pages
4459 * @page: The page within the block of interest 4471 * @page: The page within the block of interest
4460 * @start_bitidx: The first bit of interest 4472 * @start_bitidx: The first bit of interest
4461 * @end_bitidx: The last bit of interest 4473 * @end_bitidx: The last bit of interest
4462 * @flags: The flags to set 4474 * @flags: The flags to set
4463 */ 4475 */
4464 void set_pageblock_flags_group(struct page *page, unsigned long flags, 4476 void set_pageblock_flags_group(struct page *page, unsigned long flags,
4465 int start_bitidx, int end_bitidx) 4477 int start_bitidx, int end_bitidx)
4466 { 4478 {
4467 struct zone *zone; 4479 struct zone *zone;
4468 unsigned long *bitmap; 4480 unsigned long *bitmap;
4469 unsigned long pfn, bitidx; 4481 unsigned long pfn, bitidx;
4470 unsigned long value = 1; 4482 unsigned long value = 1;
4471 4483
4472 zone = page_zone(page); 4484 zone = page_zone(page);
4473 pfn = page_to_pfn(page); 4485 pfn = page_to_pfn(page);
4474 bitmap = get_pageblock_bitmap(zone, pfn); 4486 bitmap = get_pageblock_bitmap(zone, pfn);
4475 bitidx = pfn_to_bitidx(zone, pfn); 4487 bitidx = pfn_to_bitidx(zone, pfn);
4476 VM_BUG_ON(pfn < zone->zone_start_pfn); 4488 VM_BUG_ON(pfn < zone->zone_start_pfn);
4477 VM_BUG_ON(pfn >= zone->zone_start_pfn + zone->spanned_pages); 4489 VM_BUG_ON(pfn >= zone->zone_start_pfn + zone->spanned_pages);
4478 4490
4479 for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1) 4491 for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
4480 if (flags & value) 4492 if (flags & value)
4481 __set_bit(bitidx + start_bitidx, bitmap); 4493 __set_bit(bitidx + start_bitidx, bitmap);
4482 else 4494 else
4483 __clear_bit(bitidx + start_bitidx, bitmap); 4495 __clear_bit(bitidx + start_bitidx, bitmap);
4484 } 4496 }
4485 4497
4486 /* 4498 /*
4487 * This is designed as sub function...plz see page_isolation.c also. 4499 * This is designed as sub function...plz see page_isolation.c also.
4488 * set/clear page block's type to be ISOLATE. 4500 * set/clear page block's type to be ISOLATE.
4489 * page allocater never alloc memory from ISOLATE block. 4501 * page allocater never alloc memory from ISOLATE block.
4490 */ 4502 */
4491 4503
4492 int set_migratetype_isolate(struct page *page) 4504 int set_migratetype_isolate(struct page *page)
4493 { 4505 {
4494 struct zone *zone; 4506 struct zone *zone;
4495 unsigned long flags; 4507 unsigned long flags;
4496 int ret = -EBUSY; 4508 int ret = -EBUSY;
4497 4509
4498 zone = page_zone(page); 4510 zone = page_zone(page);
4499 spin_lock_irqsave(&zone->lock, flags); 4511 spin_lock_irqsave(&zone->lock, flags);
4500 /* 4512 /*
4501 * In future, more migrate types will be able to be isolation target. 4513 * In future, more migrate types will be able to be isolation target.
4502 */ 4514 */
4503 if (get_pageblock_migratetype(page) != MIGRATE_MOVABLE) 4515 if (get_pageblock_migratetype(page) != MIGRATE_MOVABLE)
4504 goto out; 4516 goto out;
4505 set_pageblock_migratetype(page, MIGRATE_ISOLATE); 4517 set_pageblock_migratetype(page, MIGRATE_ISOLATE);
4506 move_freepages_block(zone, page, MIGRATE_ISOLATE); 4518 move_freepages_block(zone, page, MIGRATE_ISOLATE);
4507 ret = 0; 4519 ret = 0;
4508 out: 4520 out:
4509 spin_unlock_irqrestore(&zone->lock, flags); 4521 spin_unlock_irqrestore(&zone->lock, flags);
4510 if (!ret) 4522 if (!ret)
4511 drain_all_pages(); 4523 drain_all_pages();
4512 return ret; 4524 return ret;
4513 } 4525 }
4514 4526
4515 void unset_migratetype_isolate(struct page *page) 4527 void unset_migratetype_isolate(struct page *page)
4516 { 4528 {
4517 struct zone *zone; 4529 struct zone *zone;
4518 unsigned long flags; 4530 unsigned long flags;
4519 zone = page_zone(page); 4531 zone = page_zone(page);
4520 spin_lock_irqsave(&zone->lock, flags); 4532 spin_lock_irqsave(&zone->lock, flags);
4521 if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE) 4533 if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
4522 goto out; 4534 goto out;
4523 set_pageblock_migratetype(page, MIGRATE_MOVABLE); 4535 set_pageblock_migratetype(page, MIGRATE_MOVABLE);
4524 move_freepages_block(zone, page, MIGRATE_MOVABLE); 4536 move_freepages_block(zone, page, MIGRATE_MOVABLE);
4525 out: 4537 out:
4526 spin_unlock_irqrestore(&zone->lock, flags); 4538 spin_unlock_irqrestore(&zone->lock, flags);
4527 } 4539 }
4528 4540
4529 #ifdef CONFIG_MEMORY_HOTREMOVE 4541 #ifdef CONFIG_MEMORY_HOTREMOVE
4530 /* 4542 /*
4531 * All pages in the range must be isolated before calling this. 4543 * All pages in the range must be isolated before calling this.
4532 */ 4544 */
4533 void 4545 void
4534 __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) 4546 __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
4535 { 4547 {
4536 struct page *page; 4548 struct page *page;
4537 struct zone *zone; 4549 struct zone *zone;
4538 int order, i; 4550 int order, i;
4539 unsigned long pfn; 4551 unsigned long pfn;
4540 unsigned long flags; 4552 unsigned long flags;
4541 /* find the first valid pfn */ 4553 /* find the first valid pfn */
4542 for (pfn = start_pfn; pfn < end_pfn; pfn++) 4554 for (pfn = start_pfn; pfn < end_pfn; pfn++)
4543 if (pfn_valid(pfn)) 4555 if (pfn_valid(pfn))
4544 break; 4556 break;
4545 if (pfn == end_pfn) 4557 if (pfn == end_pfn)
4546 return; 4558 return;
4547 zone = page_zone(pfn_to_page(pfn)); 4559 zone = page_zone(pfn_to_page(pfn));
4548 spin_lock_irqsave(&zone->lock, flags); 4560 spin_lock_irqsave(&zone->lock, flags);
4549 pfn = start_pfn; 4561 pfn = start_pfn;
4550 while (pfn < end_pfn) { 4562 while (pfn < end_pfn) {
4551 if (!pfn_valid(pfn)) { 4563 if (!pfn_valid(pfn)) {
4552 pfn++; 4564 pfn++;
4553 continue; 4565 continue;
4554 } 4566 }
4555 page = pfn_to_page(pfn); 4567 page = pfn_to_page(pfn);
4556 BUG_ON(page_count(page)); 4568 BUG_ON(page_count(page));
4557 BUG_ON(!PageBuddy(page)); 4569 BUG_ON(!PageBuddy(page));
4558 order = page_order(page); 4570 order = page_order(page);
4559 #ifdef CONFIG_DEBUG_VM 4571 #ifdef CONFIG_DEBUG_VM
4560 printk(KERN_INFO "remove from free list %lx %d %lx\n", 4572 printk(KERN_INFO "remove from free list %lx %d %lx\n",
4561 pfn, 1 << order, end_pfn); 4573 pfn, 1 << order, end_pfn);
4562 #endif 4574 #endif
4563 list_del(&page->lru); 4575 list_del(&page->lru);
4564 rmv_page_order(page); 4576 rmv_page_order(page);
4565 zone->free_area[order].nr_free--; 4577 zone->free_area[order].nr_free--;
4566 __mod_zone_page_state(zone, NR_FREE_PAGES, 4578 __mod_zone_page_state(zone, NR_FREE_PAGES,
4567 - (1UL << order)); 4579 - (1UL << order));
4568 for (i = 0; i < (1 << order); i++) 4580 for (i = 0; i < (1 << order); i++)
4569 SetPageReserved((page+i)); 4581 SetPageReserved((page+i));
4570 pfn += (1 << order); 4582 pfn += (1 << order);
4571 } 4583 }
4572 spin_unlock_irqrestore(&zone->lock, flags); 4584 spin_unlock_irqrestore(&zone->lock, flags);
4573 } 4585 }
4574 #endif 4586 #endif
1 /* 1 /*
2 * linux/mm/vmscan.c 2 * linux/mm/vmscan.c
3 * 3 *
4 * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds 4 * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds
5 * 5 *
6 * Swap reorganised 29.12.95, Stephen Tweedie. 6 * Swap reorganised 29.12.95, Stephen Tweedie.
7 * kswapd added: 7.1.96 sct 7 * kswapd added: 7.1.96 sct
8 * Removed kswapd_ctl limits, and swap out as many pages as needed 8 * Removed kswapd_ctl limits, and swap out as many pages as needed
9 * to bring the system back to freepages.high: 2.4.97, Rik van Riel. 9 * to bring the system back to freepages.high: 2.4.97, Rik van Riel.
10 * Zone aware kswapd started 02/00, Kanoj Sarcar (kanoj@sgi.com). 10 * Zone aware kswapd started 02/00, Kanoj Sarcar (kanoj@sgi.com).
11 * Multiqueue VM started 5.8.00, Rik van Riel. 11 * Multiqueue VM started 5.8.00, Rik van Riel.
12 */ 12 */
13 13
14 #include <linux/mm.h> 14 #include <linux/mm.h>
15 #include <linux/module.h> 15 #include <linux/module.h>
16 #include <linux/slab.h> 16 #include <linux/slab.h>
17 #include <linux/kernel_stat.h> 17 #include <linux/kernel_stat.h>
18 #include <linux/swap.h> 18 #include <linux/swap.h>
19 #include <linux/pagemap.h> 19 #include <linux/pagemap.h>
20 #include <linux/init.h> 20 #include <linux/init.h>
21 #include <linux/highmem.h> 21 #include <linux/highmem.h>
22 #include <linux/vmstat.h> 22 #include <linux/vmstat.h>
23 #include <linux/file.h> 23 #include <linux/file.h>
24 #include <linux/writeback.h> 24 #include <linux/writeback.h>
25 #include <linux/blkdev.h> 25 #include <linux/blkdev.h>
26 #include <linux/buffer_head.h> /* for try_to_release_page(), 26 #include <linux/buffer_head.h> /* for try_to_release_page(),
27 buffer_heads_over_limit */ 27 buffer_heads_over_limit */
28 #include <linux/mm_inline.h> 28 #include <linux/mm_inline.h>
29 #include <linux/pagevec.h> 29 #include <linux/pagevec.h>
30 #include <linux/backing-dev.h> 30 #include <linux/backing-dev.h>
31 #include <linux/rmap.h> 31 #include <linux/rmap.h>
32 #include <linux/topology.h> 32 #include <linux/topology.h>
33 #include <linux/cpu.h> 33 #include <linux/cpu.h>
34 #include <linux/cpuset.h> 34 #include <linux/cpuset.h>
35 #include <linux/notifier.h> 35 #include <linux/notifier.h>
36 #include <linux/rwsem.h> 36 #include <linux/rwsem.h>
37 #include <linux/delay.h> 37 #include <linux/delay.h>
38 #include <linux/kthread.h> 38 #include <linux/kthread.h>
39 #include <linux/freezer.h> 39 #include <linux/freezer.h>
40 #include <linux/memcontrol.h> 40 #include <linux/memcontrol.h>
41 41
42 #include <asm/tlbflush.h> 42 #include <asm/tlbflush.h>
43 #include <asm/div64.h> 43 #include <asm/div64.h>
44 44
45 #include <linux/swapops.h> 45 #include <linux/swapops.h>
46 46
47 #include "internal.h" 47 #include "internal.h"
48 48
49 struct scan_control { 49 struct scan_control {
50 /* Incremented by the number of inactive pages that were scanned */ 50 /* Incremented by the number of inactive pages that were scanned */
51 unsigned long nr_scanned; 51 unsigned long nr_scanned;
52 52
53 /* This context's GFP mask */ 53 /* This context's GFP mask */
54 gfp_t gfp_mask; 54 gfp_t gfp_mask;
55 55
56 int may_writepage; 56 int may_writepage;
57 57
58 /* Can pages be swapped as part of reclaim? */ 58 /* Can pages be swapped as part of reclaim? */
59 int may_swap; 59 int may_swap;
60 60
61 /* This context's SWAP_CLUSTER_MAX. If freeing memory for 61 /* This context's SWAP_CLUSTER_MAX. If freeing memory for
62 * suspend, we effectively ignore SWAP_CLUSTER_MAX. 62 * suspend, we effectively ignore SWAP_CLUSTER_MAX.
63 * In this context, it doesn't matter that we scan the 63 * In this context, it doesn't matter that we scan the
64 * whole list at once. */ 64 * whole list at once. */
65 int swap_cluster_max; 65 int swap_cluster_max;
66 66
67 int swappiness; 67 int swappiness;
68 68
69 int all_unreclaimable; 69 int all_unreclaimable;
70 70
71 int order; 71 int order;
72 72
73 /* Which cgroup do we reclaim from */ 73 /* Which cgroup do we reclaim from */
74 struct mem_cgroup *mem_cgroup; 74 struct mem_cgroup *mem_cgroup;
75 75
76 /* Pluggable isolate pages callback */ 76 /* Pluggable isolate pages callback */
77 unsigned long (*isolate_pages)(unsigned long nr, struct list_head *dst, 77 unsigned long (*isolate_pages)(unsigned long nr, struct list_head *dst,
78 unsigned long *scanned, int order, int mode, 78 unsigned long *scanned, int order, int mode,
79 struct zone *z, struct mem_cgroup *mem_cont, 79 struct zone *z, struct mem_cgroup *mem_cont,
80 int active); 80 int active);
81 }; 81 };
82 82
83 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru)) 83 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
84 84
85 #ifdef ARCH_HAS_PREFETCH 85 #ifdef ARCH_HAS_PREFETCH
86 #define prefetch_prev_lru_page(_page, _base, _field) \ 86 #define prefetch_prev_lru_page(_page, _base, _field) \
87 do { \ 87 do { \
88 if ((_page)->lru.prev != _base) { \ 88 if ((_page)->lru.prev != _base) { \
89 struct page *prev; \ 89 struct page *prev; \
90 \ 90 \
91 prev = lru_to_page(&(_page->lru)); \ 91 prev = lru_to_page(&(_page->lru)); \
92 prefetch(&prev->_field); \ 92 prefetch(&prev->_field); \
93 } \ 93 } \
94 } while (0) 94 } while (0)
95 #else 95 #else
96 #define prefetch_prev_lru_page(_page, _base, _field) do { } while (0) 96 #define prefetch_prev_lru_page(_page, _base, _field) do { } while (0)
97 #endif 97 #endif
98 98
99 #ifdef ARCH_HAS_PREFETCHW 99 #ifdef ARCH_HAS_PREFETCHW
100 #define prefetchw_prev_lru_page(_page, _base, _field) \ 100 #define prefetchw_prev_lru_page(_page, _base, _field) \
101 do { \ 101 do { \
102 if ((_page)->lru.prev != _base) { \ 102 if ((_page)->lru.prev != _base) { \
103 struct page *prev; \ 103 struct page *prev; \
104 \ 104 \
105 prev = lru_to_page(&(_page->lru)); \ 105 prev = lru_to_page(&(_page->lru)); \
106 prefetchw(&prev->_field); \ 106 prefetchw(&prev->_field); \
107 } \ 107 } \
108 } while (0) 108 } while (0)
109 #else 109 #else
110 #define prefetchw_prev_lru_page(_page, _base, _field) do { } while (0) 110 #define prefetchw_prev_lru_page(_page, _base, _field) do { } while (0)
111 #endif 111 #endif
112 112
113 /* 113 /*
114 * From 0 .. 100. Higher means more swappy. 114 * From 0 .. 100. Higher means more swappy.
115 */ 115 */
116 int vm_swappiness = 60; 116 int vm_swappiness = 60;
117 long vm_total_pages; /* The total number of pages which the VM controls */ 117 long vm_total_pages; /* The total number of pages which the VM controls */
118 118
119 static LIST_HEAD(shrinker_list); 119 static LIST_HEAD(shrinker_list);
120 static DECLARE_RWSEM(shrinker_rwsem); 120 static DECLARE_RWSEM(shrinker_rwsem);
121 121
122 #ifdef CONFIG_CGROUP_MEM_RES_CTLR 122 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
123 #define scan_global_lru(sc) (!(sc)->mem_cgroup) 123 #define scan_global_lru(sc) (!(sc)->mem_cgroup)
124 #else 124 #else
125 #define scan_global_lru(sc) (1) 125 #define scan_global_lru(sc) (1)
126 #endif 126 #endif
127 127
128 /* 128 /*
129 * Add a shrinker callback to be called from the vm 129 * Add a shrinker callback to be called from the vm
130 */ 130 */
131 void register_shrinker(struct shrinker *shrinker) 131 void register_shrinker(struct shrinker *shrinker)
132 { 132 {
133 shrinker->nr = 0; 133 shrinker->nr = 0;
134 down_write(&shrinker_rwsem); 134 down_write(&shrinker_rwsem);
135 list_add_tail(&shrinker->list, &shrinker_list); 135 list_add_tail(&shrinker->list, &shrinker_list);
136 up_write(&shrinker_rwsem); 136 up_write(&shrinker_rwsem);
137 } 137 }
138 EXPORT_SYMBOL(register_shrinker); 138 EXPORT_SYMBOL(register_shrinker);
139 139
140 /* 140 /*
141 * Remove one 141 * Remove one
142 */ 142 */
143 void unregister_shrinker(struct shrinker *shrinker) 143 void unregister_shrinker(struct shrinker *shrinker)
144 { 144 {
145 down_write(&shrinker_rwsem); 145 down_write(&shrinker_rwsem);
146 list_del(&shrinker->list); 146 list_del(&shrinker->list);
147 up_write(&shrinker_rwsem); 147 up_write(&shrinker_rwsem);
148 } 148 }
149 EXPORT_SYMBOL(unregister_shrinker); 149 EXPORT_SYMBOL(unregister_shrinker);
150 150
151 #define SHRINK_BATCH 128 151 #define SHRINK_BATCH 128
152 /* 152 /*
153 * Call the shrink functions to age shrinkable caches 153 * Call the shrink functions to age shrinkable caches
154 * 154 *
155 * Here we assume it costs one seek to replace a lru page and that it also 155 * Here we assume it costs one seek to replace a lru page and that it also
156 * takes a seek to recreate a cache object. With this in mind we age equal 156 * takes a seek to recreate a cache object. With this in mind we age equal
157 * percentages of the lru and ageable caches. This should balance the seeks 157 * percentages of the lru and ageable caches. This should balance the seeks
158 * generated by these structures. 158 * generated by these structures.
159 * 159 *
160 * If the vm encountered mapped pages on the LRU it increase the pressure on 160 * If the vm encountered mapped pages on the LRU it increase the pressure on
161 * slab to avoid swapping. 161 * slab to avoid swapping.
162 * 162 *
163 * We do weird things to avoid (scanned*seeks*entries) overflowing 32 bits. 163 * We do weird things to avoid (scanned*seeks*entries) overflowing 32 bits.
164 * 164 *
165 * `lru_pages' represents the number of on-LRU pages in all the zones which 165 * `lru_pages' represents the number of on-LRU pages in all the zones which
166 * are eligible for the caller's allocation attempt. It is used for balancing 166 * are eligible for the caller's allocation attempt. It is used for balancing
167 * slab reclaim versus page reclaim. 167 * slab reclaim versus page reclaim.
168 * 168 *
169 * Returns the number of slab objects which we shrunk. 169 * Returns the number of slab objects which we shrunk.
170 */ 170 */
171 unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, 171 unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
172 unsigned long lru_pages) 172 unsigned long lru_pages)
173 { 173 {
174 struct shrinker *shrinker; 174 struct shrinker *shrinker;
175 unsigned long ret = 0; 175 unsigned long ret = 0;
176 176
177 if (scanned == 0) 177 if (scanned == 0)
178 scanned = SWAP_CLUSTER_MAX; 178 scanned = SWAP_CLUSTER_MAX;
179 179
180 if (!down_read_trylock(&shrinker_rwsem)) 180 if (!down_read_trylock(&shrinker_rwsem))
181 return 1; /* Assume we'll be able to shrink next time */ 181 return 1; /* Assume we'll be able to shrink next time */
182 182
183 list_for_each_entry(shrinker, &shrinker_list, list) { 183 list_for_each_entry(shrinker, &shrinker_list, list) {
184 unsigned long long delta; 184 unsigned long long delta;
185 unsigned long total_scan; 185 unsigned long total_scan;
186 unsigned long max_pass = (*shrinker->shrink)(0, gfp_mask); 186 unsigned long max_pass = (*shrinker->shrink)(0, gfp_mask);
187 187
188 delta = (4 * scanned) / shrinker->seeks; 188 delta = (4 * scanned) / shrinker->seeks;
189 delta *= max_pass; 189 delta *= max_pass;
190 do_div(delta, lru_pages + 1); 190 do_div(delta, lru_pages + 1);
191 shrinker->nr += delta; 191 shrinker->nr += delta;
192 if (shrinker->nr < 0) { 192 if (shrinker->nr < 0) {
193 printk(KERN_ERR "%s: nr=%ld\n", 193 printk(KERN_ERR "%s: nr=%ld\n",
194 __FUNCTION__, shrinker->nr); 194 __FUNCTION__, shrinker->nr);
195 shrinker->nr = max_pass; 195 shrinker->nr = max_pass;
196 } 196 }
197 197
198 /* 198 /*
199 * Avoid risking looping forever due to too large nr value: 199 * Avoid risking looping forever due to too large nr value:
200 * never try to free more than twice the estimate number of 200 * never try to free more than twice the estimate number of
201 * freeable entries. 201 * freeable entries.
202 */ 202 */
203 if (shrinker->nr > max_pass * 2) 203 if (shrinker->nr > max_pass * 2)
204 shrinker->nr = max_pass * 2; 204 shrinker->nr = max_pass * 2;
205 205
206 total_scan = shrinker->nr; 206 total_scan = shrinker->nr;
207 shrinker->nr = 0; 207 shrinker->nr = 0;
208 208
209 while (total_scan >= SHRINK_BATCH) { 209 while (total_scan >= SHRINK_BATCH) {
210 long this_scan = SHRINK_BATCH; 210 long this_scan = SHRINK_BATCH;
211 int shrink_ret; 211 int shrink_ret;
212 int nr_before; 212 int nr_before;
213 213
214 nr_before = (*shrinker->shrink)(0, gfp_mask); 214 nr_before = (*shrinker->shrink)(0, gfp_mask);
215 shrink_ret = (*shrinker->shrink)(this_scan, gfp_mask); 215 shrink_ret = (*shrinker->shrink)(this_scan, gfp_mask);
216 if (shrink_ret == -1) 216 if (shrink_ret == -1)
217 break; 217 break;
218 if (shrink_ret < nr_before) 218 if (shrink_ret < nr_before)
219 ret += nr_before - shrink_ret; 219 ret += nr_before - shrink_ret;
220 count_vm_events(SLABS_SCANNED, this_scan); 220 count_vm_events(SLABS_SCANNED, this_scan);
221 total_scan -= this_scan; 221 total_scan -= this_scan;
222 222
223 cond_resched(); 223 cond_resched();
224 } 224 }
225 225
226 shrinker->nr += total_scan; 226 shrinker->nr += total_scan;
227 } 227 }
228 up_read(&shrinker_rwsem); 228 up_read(&shrinker_rwsem);
229 return ret; 229 return ret;
230 } 230 }
231 231
232 /* Called without lock on whether page is mapped, so answer is unstable */ 232 /* Called without lock on whether page is mapped, so answer is unstable */
233 static inline int page_mapping_inuse(struct page *page) 233 static inline int page_mapping_inuse(struct page *page)
234 { 234 {
235 struct address_space *mapping; 235 struct address_space *mapping;
236 236
237 /* Page is in somebody's page tables. */ 237 /* Page is in somebody's page tables. */
238 if (page_mapped(page)) 238 if (page_mapped(page))
239 return 1; 239 return 1;
240 240
241 /* Be more reluctant to reclaim swapcache than pagecache */ 241 /* Be more reluctant to reclaim swapcache than pagecache */
242 if (PageSwapCache(page)) 242 if (PageSwapCache(page))
243 return 1; 243 return 1;
244 244
245 mapping = page_mapping(page); 245 mapping = page_mapping(page);
246 if (!mapping) 246 if (!mapping)
247 return 0; 247 return 0;
248 248
249 /* File is mmap'd by somebody? */ 249 /* File is mmap'd by somebody? */
250 return mapping_mapped(mapping); 250 return mapping_mapped(mapping);
251 } 251 }
252 252
253 static inline int is_page_cache_freeable(struct page *page) 253 static inline int is_page_cache_freeable(struct page *page)
254 { 254 {
255 return page_count(page) - !!PagePrivate(page) == 2; 255 return page_count(page) - !!PagePrivate(page) == 2;
256 } 256 }
257 257
258 static int may_write_to_queue(struct backing_dev_info *bdi) 258 static int may_write_to_queue(struct backing_dev_info *bdi)
259 { 259 {
260 if (current->flags & PF_SWAPWRITE) 260 if (current->flags & PF_SWAPWRITE)
261 return 1; 261 return 1;
262 if (!bdi_write_congested(bdi)) 262 if (!bdi_write_congested(bdi))
263 return 1; 263 return 1;
264 if (bdi == current->backing_dev_info) 264 if (bdi == current->backing_dev_info)
265 return 1; 265 return 1;
266 return 0; 266 return 0;
267 } 267 }
268 268
269 /* 269 /*
270 * We detected a synchronous write error writing a page out. Probably 270 * We detected a synchronous write error writing a page out. Probably
271 * -ENOSPC. We need to propagate that into the address_space for a subsequent 271 * -ENOSPC. We need to propagate that into the address_space for a subsequent
272 * fsync(), msync() or close(). 272 * fsync(), msync() or close().
273 * 273 *
274 * The tricky part is that after writepage we cannot touch the mapping: nothing 274 * The tricky part is that after writepage we cannot touch the mapping: nothing
275 * prevents it from being freed up. But we have a ref on the page and once 275 * prevents it from being freed up. But we have a ref on the page and once
276 * that page is locked, the mapping is pinned. 276 * that page is locked, the mapping is pinned.
277 * 277 *
278 * We're allowed to run sleeping lock_page() here because we know the caller has 278 * We're allowed to run sleeping lock_page() here because we know the caller has
279 * __GFP_FS. 279 * __GFP_FS.
280 */ 280 */
281 static void handle_write_error(struct address_space *mapping, 281 static void handle_write_error(struct address_space *mapping,
282 struct page *page, int error) 282 struct page *page, int error)
283 { 283 {
284 lock_page(page); 284 lock_page(page);
285 if (page_mapping(page) == mapping) 285 if (page_mapping(page) == mapping)
286 mapping_set_error(mapping, error); 286 mapping_set_error(mapping, error);
287 unlock_page(page); 287 unlock_page(page);
288 } 288 }
289 289
290 /* Request for sync pageout. */ 290 /* Request for sync pageout. */
291 enum pageout_io { 291 enum pageout_io {
292 PAGEOUT_IO_ASYNC, 292 PAGEOUT_IO_ASYNC,
293 PAGEOUT_IO_SYNC, 293 PAGEOUT_IO_SYNC,
294 }; 294 };
295 295
296 /* possible outcome of pageout() */ 296 /* possible outcome of pageout() */
297 typedef enum { 297 typedef enum {
298 /* failed to write page out, page is locked */ 298 /* failed to write page out, page is locked */
299 PAGE_KEEP, 299 PAGE_KEEP,
300 /* move page to the active list, page is locked */ 300 /* move page to the active list, page is locked */
301 PAGE_ACTIVATE, 301 PAGE_ACTIVATE,
302 /* page has been sent to the disk successfully, page is unlocked */ 302 /* page has been sent to the disk successfully, page is unlocked */
303 PAGE_SUCCESS, 303 PAGE_SUCCESS,
304 /* page is clean and locked */ 304 /* page is clean and locked */
305 PAGE_CLEAN, 305 PAGE_CLEAN,
306 } pageout_t; 306 } pageout_t;
307 307
308 /* 308 /*
309 * pageout is called by shrink_page_list() for each dirty page. 309 * pageout is called by shrink_page_list() for each dirty page.
310 * Calls ->writepage(). 310 * Calls ->writepage().
311 */ 311 */
312 static pageout_t pageout(struct page *page, struct address_space *mapping, 312 static pageout_t pageout(struct page *page, struct address_space *mapping,
313 enum pageout_io sync_writeback) 313 enum pageout_io sync_writeback)
314 { 314 {
315 /* 315 /*
316 * If the page is dirty, only perform writeback if that write 316 * If the page is dirty, only perform writeback if that write
317 * will be non-blocking. To prevent this allocation from being 317 * will be non-blocking. To prevent this allocation from being
318 * stalled by pagecache activity. But note that there may be 318 * stalled by pagecache activity. But note that there may be
319 * stalls if we need to run get_block(). We could test 319 * stalls if we need to run get_block(). We could test
320 * PagePrivate for that. 320 * PagePrivate for that.
321 * 321 *
322 * If this process is currently in generic_file_write() against 322 * If this process is currently in generic_file_write() against
323 * this page's queue, we can perform writeback even if that 323 * this page's queue, we can perform writeback even if that
324 * will block. 324 * will block.
325 * 325 *
326 * If the page is swapcache, write it back even if that would 326 * If the page is swapcache, write it back even if that would
327 * block, for some throttling. This happens by accident, because 327 * block, for some throttling. This happens by accident, because
328 * swap_backing_dev_info is bust: it doesn't reflect the 328 * swap_backing_dev_info is bust: it doesn't reflect the
329 * congestion state of the swapdevs. Easy to fix, if needed. 329 * congestion state of the swapdevs. Easy to fix, if needed.
330 * See swapfile.c:page_queue_congested(). 330 * See swapfile.c:page_queue_congested().
331 */ 331 */
332 if (!is_page_cache_freeable(page)) 332 if (!is_page_cache_freeable(page))
333 return PAGE_KEEP; 333 return PAGE_KEEP;
334 if (!mapping) { 334 if (!mapping) {
335 /* 335 /*
336 * Some data journaling orphaned pages can have 336 * Some data journaling orphaned pages can have
337 * page->mapping == NULL while being dirty with clean buffers. 337 * page->mapping == NULL while being dirty with clean buffers.
338 */ 338 */
339 if (PagePrivate(page)) { 339 if (PagePrivate(page)) {
340 if (try_to_free_buffers(page)) { 340 if (try_to_free_buffers(page)) {
341 ClearPageDirty(page); 341 ClearPageDirty(page);
342 printk("%s: orphaned page\n", __FUNCTION__); 342 printk("%s: orphaned page\n", __FUNCTION__);
343 return PAGE_CLEAN; 343 return PAGE_CLEAN;
344 } 344 }
345 } 345 }
346 return PAGE_KEEP; 346 return PAGE_KEEP;
347 } 347 }
348 if (mapping->a_ops->writepage == NULL) 348 if (mapping->a_ops->writepage == NULL)
349 return PAGE_ACTIVATE; 349 return PAGE_ACTIVATE;
350 if (!may_write_to_queue(mapping->backing_dev_info)) 350 if (!may_write_to_queue(mapping->backing_dev_info))
351 return PAGE_KEEP; 351 return PAGE_KEEP;
352 352
353 if (clear_page_dirty_for_io(page)) { 353 if (clear_page_dirty_for_io(page)) {
354 int res; 354 int res;
355 struct writeback_control wbc = { 355 struct writeback_control wbc = {
356 .sync_mode = WB_SYNC_NONE, 356 .sync_mode = WB_SYNC_NONE,
357 .nr_to_write = SWAP_CLUSTER_MAX, 357 .nr_to_write = SWAP_CLUSTER_MAX,
358 .range_start = 0, 358 .range_start = 0,
359 .range_end = LLONG_MAX, 359 .range_end = LLONG_MAX,
360 .nonblocking = 1, 360 .nonblocking = 1,
361 .for_reclaim = 1, 361 .for_reclaim = 1,
362 }; 362 };
363 363
364 SetPageReclaim(page); 364 SetPageReclaim(page);
365 res = mapping->a_ops->writepage(page, &wbc); 365 res = mapping->a_ops->writepage(page, &wbc);
366 if (res < 0) 366 if (res < 0)
367 handle_write_error(mapping, page, res); 367 handle_write_error(mapping, page, res);
368 if (res == AOP_WRITEPAGE_ACTIVATE) { 368 if (res == AOP_WRITEPAGE_ACTIVATE) {
369 ClearPageReclaim(page); 369 ClearPageReclaim(page);
370 return PAGE_ACTIVATE; 370 return PAGE_ACTIVATE;
371 } 371 }
372 372
373 /* 373 /*
374 * Wait on writeback if requested to. This happens when 374 * Wait on writeback if requested to. This happens when
375 * direct reclaiming a large contiguous area and the 375 * direct reclaiming a large contiguous area and the
376 * first attempt to free a range of pages fails. 376 * first attempt to free a range of pages fails.
377 */ 377 */
378 if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC) 378 if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
379 wait_on_page_writeback(page); 379 wait_on_page_writeback(page);
380 380
381 if (!PageWriteback(page)) { 381 if (!PageWriteback(page)) {
382 /* synchronous write or broken a_ops? */ 382 /* synchronous write or broken a_ops? */
383 ClearPageReclaim(page); 383 ClearPageReclaim(page);
384 } 384 }
385 inc_zone_page_state(page, NR_VMSCAN_WRITE); 385 inc_zone_page_state(page, NR_VMSCAN_WRITE);
386 return PAGE_SUCCESS; 386 return PAGE_SUCCESS;
387 } 387 }
388 388
389 return PAGE_CLEAN; 389 return PAGE_CLEAN;
390 } 390 }
391 391
392 /* 392 /*
393 * Attempt to detach a locked page from its ->mapping. If it is dirty or if 393 * Attempt to detach a locked page from its ->mapping. If it is dirty or if
394 * someone else has a ref on the page, abort and return 0. If it was 394 * someone else has a ref on the page, abort and return 0. If it was
395 * successfully detached, return 1. Assumes the caller has a single ref on 395 * successfully detached, return 1. Assumes the caller has a single ref on
396 * this page. 396 * this page.
397 */ 397 */
398 int remove_mapping(struct address_space *mapping, struct page *page) 398 int remove_mapping(struct address_space *mapping, struct page *page)
399 { 399 {
400 BUG_ON(!PageLocked(page)); 400 BUG_ON(!PageLocked(page));
401 BUG_ON(mapping != page_mapping(page)); 401 BUG_ON(mapping != page_mapping(page));
402 402
403 write_lock_irq(&mapping->tree_lock); 403 write_lock_irq(&mapping->tree_lock);
404 /* 404 /*
405 * The non racy check for a busy page. 405 * The non racy check for a busy page.
406 * 406 *
407 * Must be careful with the order of the tests. When someone has 407 * Must be careful with the order of the tests. When someone has
408 * a ref to the page, it may be possible that they dirty it then 408 * a ref to the page, it may be possible that they dirty it then
409 * drop the reference. So if PageDirty is tested before page_count 409 * drop the reference. So if PageDirty is tested before page_count
410 * here, then the following race may occur: 410 * here, then the following race may occur:
411 * 411 *
412 * get_user_pages(&page); 412 * get_user_pages(&page);
413 * [user mapping goes away] 413 * [user mapping goes away]
414 * write_to(page); 414 * write_to(page);
415 * !PageDirty(page) [good] 415 * !PageDirty(page) [good]
416 * SetPageDirty(page); 416 * SetPageDirty(page);
417 * put_page(page); 417 * put_page(page);
418 * !page_count(page) [good, discard it] 418 * !page_count(page) [good, discard it]
419 * 419 *
420 * [oops, our write_to data is lost] 420 * [oops, our write_to data is lost]
421 * 421 *
422 * Reversing the order of the tests ensures such a situation cannot 422 * Reversing the order of the tests ensures such a situation cannot
423 * escape unnoticed. The smp_rmb is needed to ensure the page->flags 423 * escape unnoticed. The smp_rmb is needed to ensure the page->flags
424 * load is not satisfied before that of page->_count. 424 * load is not satisfied before that of page->_count.
425 * 425 *
426 * Note that if SetPageDirty is always performed via set_page_dirty, 426 * Note that if SetPageDirty is always performed via set_page_dirty,
427 * and thus under tree_lock, then this ordering is not required. 427 * and thus under tree_lock, then this ordering is not required.
428 */ 428 */
429 if (unlikely(page_count(page) != 2)) 429 if (unlikely(page_count(page) != 2))
430 goto cannot_free; 430 goto cannot_free;
431 smp_rmb(); 431 smp_rmb();
432 if (unlikely(PageDirty(page))) 432 if (unlikely(PageDirty(page)))
433 goto cannot_free; 433 goto cannot_free;
434 434
435 if (PageSwapCache(page)) { 435 if (PageSwapCache(page)) {
436 swp_entry_t swap = { .val = page_private(page) }; 436 swp_entry_t swap = { .val = page_private(page) };
437 __delete_from_swap_cache(page); 437 __delete_from_swap_cache(page);
438 write_unlock_irq(&mapping->tree_lock); 438 write_unlock_irq(&mapping->tree_lock);
439 swap_free(swap); 439 swap_free(swap);
440 __put_page(page); /* The pagecache ref */ 440 __put_page(page); /* The pagecache ref */
441 return 1; 441 return 1;
442 } 442 }
443 443
444 __remove_from_page_cache(page); 444 __remove_from_page_cache(page);
445 write_unlock_irq(&mapping->tree_lock); 445 write_unlock_irq(&mapping->tree_lock);
446 __put_page(page); 446 __put_page(page);
447 return 1; 447 return 1;
448 448
449 cannot_free: 449 cannot_free:
450 write_unlock_irq(&mapping->tree_lock); 450 write_unlock_irq(&mapping->tree_lock);
451 return 0; 451 return 0;
452 } 452 }
453 453
454 /* 454 /*
455 * shrink_page_list() returns the number of reclaimed pages 455 * shrink_page_list() returns the number of reclaimed pages
456 */ 456 */
457 static unsigned long shrink_page_list(struct list_head *page_list, 457 static unsigned long shrink_page_list(struct list_head *page_list,
458 struct scan_control *sc, 458 struct scan_control *sc,
459 enum pageout_io sync_writeback) 459 enum pageout_io sync_writeback)
460 { 460 {
461 LIST_HEAD(ret_pages); 461 LIST_HEAD(ret_pages);
462 struct pagevec freed_pvec; 462 struct pagevec freed_pvec;
463 int pgactivate = 0; 463 int pgactivate = 0;
464 unsigned long nr_reclaimed = 0; 464 unsigned long nr_reclaimed = 0;
465 465
466 cond_resched(); 466 cond_resched();
467 467
468 pagevec_init(&freed_pvec, 1); 468 pagevec_init(&freed_pvec, 1);
469 while (!list_empty(page_list)) { 469 while (!list_empty(page_list)) {
470 struct address_space *mapping; 470 struct address_space *mapping;
471 struct page *page; 471 struct page *page;
472 int may_enter_fs; 472 int may_enter_fs;
473 int referenced; 473 int referenced;
474 474
475 cond_resched(); 475 cond_resched();
476 476
477 page = lru_to_page(page_list); 477 page = lru_to_page(page_list);
478 list_del(&page->lru); 478 list_del(&page->lru);
479 479
480 if (TestSetPageLocked(page)) 480 if (TestSetPageLocked(page))
481 goto keep; 481 goto keep;
482 482
483 VM_BUG_ON(PageActive(page)); 483 VM_BUG_ON(PageActive(page));
484 484
485 sc->nr_scanned++; 485 sc->nr_scanned++;
486 486
487 if (!sc->may_swap && page_mapped(page)) 487 if (!sc->may_swap && page_mapped(page))
488 goto keep_locked; 488 goto keep_locked;
489 489
490 /* Double the slab pressure for mapped and swapcache pages */ 490 /* Double the slab pressure for mapped and swapcache pages */
491 if (page_mapped(page) || PageSwapCache(page)) 491 if (page_mapped(page) || PageSwapCache(page))
492 sc->nr_scanned++; 492 sc->nr_scanned++;
493 493
494 may_enter_fs = (sc->gfp_mask & __GFP_FS) || 494 may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
495 (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO)); 495 (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
496 496
497 if (PageWriteback(page)) { 497 if (PageWriteback(page)) {
498 /* 498 /*
499 * Synchronous reclaim is performed in two passes, 499 * Synchronous reclaim is performed in two passes,
500 * first an asynchronous pass over the list to 500 * first an asynchronous pass over the list to
501 * start parallel writeback, and a second synchronous 501 * start parallel writeback, and a second synchronous
502 * pass to wait for the IO to complete. Wait here 502 * pass to wait for the IO to complete. Wait here
503 * for any page for which writeback has already 503 * for any page for which writeback has already
504 * started. 504 * started.
505 */ 505 */
506 if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs) 506 if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs)
507 wait_on_page_writeback(page); 507 wait_on_page_writeback(page);
508 else 508 else
509 goto keep_locked; 509 goto keep_locked;
510 } 510 }
511 511
512 referenced = page_referenced(page, 1, sc->mem_cgroup); 512 referenced = page_referenced(page, 1, sc->mem_cgroup);
513 /* In active use or really unfreeable? Activate it. */ 513 /* In active use or really unfreeable? Activate it. */
514 if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && 514 if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
515 referenced && page_mapping_inuse(page)) 515 referenced && page_mapping_inuse(page))
516 goto activate_locked; 516 goto activate_locked;
517 517
518 #ifdef CONFIG_SWAP 518 #ifdef CONFIG_SWAP
519 /* 519 /*
520 * Anonymous process memory has backing store? 520 * Anonymous process memory has backing store?
521 * Try to allocate it some swap space here. 521 * Try to allocate it some swap space here.
522 */ 522 */
523 if (PageAnon(page) && !PageSwapCache(page)) 523 if (PageAnon(page) && !PageSwapCache(page))
524 if (!add_to_swap(page, GFP_ATOMIC)) 524 if (!add_to_swap(page, GFP_ATOMIC))
525 goto activate_locked; 525 goto activate_locked;
526 #endif /* CONFIG_SWAP */ 526 #endif /* CONFIG_SWAP */
527 527
528 mapping = page_mapping(page); 528 mapping = page_mapping(page);
529 529
530 /* 530 /*
531 * The page is mapped into the page tables of one or more 531 * The page is mapped into the page tables of one or more
532 * processes. Try to unmap it here. 532 * processes. Try to unmap it here.
533 */ 533 */
534 if (page_mapped(page) && mapping) { 534 if (page_mapped(page) && mapping) {
535 switch (try_to_unmap(page, 0)) { 535 switch (try_to_unmap(page, 0)) {
536 case SWAP_FAIL: 536 case SWAP_FAIL:
537 goto activate_locked; 537 goto activate_locked;
538 case SWAP_AGAIN: 538 case SWAP_AGAIN:
539 goto keep_locked; 539 goto keep_locked;
540 case SWAP_SUCCESS: 540 case SWAP_SUCCESS:
541 ; /* try to free the page below */ 541 ; /* try to free the page below */
542 } 542 }
543 } 543 }
544 544
545 if (PageDirty(page)) { 545 if (PageDirty(page)) {
546 if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && referenced) 546 if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && referenced)
547 goto keep_locked; 547 goto keep_locked;
548 if (!may_enter_fs) 548 if (!may_enter_fs)
549 goto keep_locked; 549 goto keep_locked;
550 if (!sc->may_writepage) 550 if (!sc->may_writepage)
551 goto keep_locked; 551 goto keep_locked;
552 552
553 /* Page is dirty, try to write it out here */ 553 /* Page is dirty, try to write it out here */
554 switch (pageout(page, mapping, sync_writeback)) { 554 switch (pageout(page, mapping, sync_writeback)) {
555 case PAGE_KEEP: 555 case PAGE_KEEP:
556 goto keep_locked; 556 goto keep_locked;
557 case PAGE_ACTIVATE: 557 case PAGE_ACTIVATE:
558 goto activate_locked; 558 goto activate_locked;
559 case PAGE_SUCCESS: 559 case PAGE_SUCCESS:
560 if (PageWriteback(page) || PageDirty(page)) 560 if (PageWriteback(page) || PageDirty(page))
561 goto keep; 561 goto keep;
562 /* 562 /*
563 * A synchronous write - probably a ramdisk. Go 563 * A synchronous write - probably a ramdisk. Go
564 * ahead and try to reclaim the page. 564 * ahead and try to reclaim the page.
565 */ 565 */
566 if (TestSetPageLocked(page)) 566 if (TestSetPageLocked(page))
567 goto keep; 567 goto keep;
568 if (PageDirty(page) || PageWriteback(page)) 568 if (PageDirty(page) || PageWriteback(page))
569 goto keep_locked; 569 goto keep_locked;
570 mapping = page_mapping(page); 570 mapping = page_mapping(page);
571 case PAGE_CLEAN: 571 case PAGE_CLEAN:
572 ; /* try to free the page below */ 572 ; /* try to free the page below */
573 } 573 }
574 } 574 }
575 575
576 /* 576 /*
577 * If the page has buffers, try to free the buffer mappings 577 * If the page has buffers, try to free the buffer mappings
578 * associated with this page. If we succeed we try to free 578 * associated with this page. If we succeed we try to free
579 * the page as well. 579 * the page as well.
580 * 580 *
581 * We do this even if the page is PageDirty(). 581 * We do this even if the page is PageDirty().
582 * try_to_release_page() does not perform I/O, but it is 582 * try_to_release_page() does not perform I/O, but it is
583 * possible for a page to have PageDirty set, but it is actually 583 * possible for a page to have PageDirty set, but it is actually
584 * clean (all its buffers are clean). This happens if the 584 * clean (all its buffers are clean). This happens if the
585 * buffers were written out directly, with submit_bh(). ext3 585 * buffers were written out directly, with submit_bh(). ext3
586 * will do this, as well as the blockdev mapping. 586 * will do this, as well as the blockdev mapping.
587 * try_to_release_page() will discover that cleanness and will 587 * try_to_release_page() will discover that cleanness and will
588 * drop the buffers and mark the page clean - it can be freed. 588 * drop the buffers and mark the page clean - it can be freed.
589 * 589 *
590 * Rarely, pages can have buffers and no ->mapping. These are 590 * Rarely, pages can have buffers and no ->mapping. These are
591 * the pages which were not successfully invalidated in 591 * the pages which were not successfully invalidated in
592 * truncate_complete_page(). We try to drop those buffers here 592 * truncate_complete_page(). We try to drop those buffers here
593 * and if that worked, and the page is no longer mapped into 593 * and if that worked, and the page is no longer mapped into
594 * process address space (page_count == 1) it can be freed. 594 * process address space (page_count == 1) it can be freed.
595 * Otherwise, leave the page on the LRU so it is swappable. 595 * Otherwise, leave the page on the LRU so it is swappable.
596 */ 596 */
597 if (PagePrivate(page)) { 597 if (PagePrivate(page)) {
598 if (!try_to_release_page(page, sc->gfp_mask)) 598 if (!try_to_release_page(page, sc->gfp_mask))
599 goto activate_locked; 599 goto activate_locked;
600 if (!mapping && page_count(page) == 1) 600 if (!mapping && page_count(page) == 1)
601 goto free_it; 601 goto free_it;
602 } 602 }
603 603
604 if (!mapping || !remove_mapping(mapping, page)) 604 if (!mapping || !remove_mapping(mapping, page))
605 goto keep_locked; 605 goto keep_locked;
606 606
607 free_it: 607 free_it:
608 unlock_page(page); 608 unlock_page(page);
609 nr_reclaimed++; 609 nr_reclaimed++;
610 if (!pagevec_add(&freed_pvec, page)) 610 if (!pagevec_add(&freed_pvec, page))
611 __pagevec_release_nonlru(&freed_pvec); 611 __pagevec_release_nonlru(&freed_pvec);
612 continue; 612 continue;
613 613
614 activate_locked: 614 activate_locked:
615 SetPageActive(page); 615 SetPageActive(page);
616 pgactivate++; 616 pgactivate++;
617 keep_locked: 617 keep_locked:
618 unlock_page(page); 618 unlock_page(page);
619 keep: 619 keep:
620 list_add(&page->lru, &ret_pages); 620 list_add(&page->lru, &ret_pages);
621 VM_BUG_ON(PageLRU(page)); 621 VM_BUG_ON(PageLRU(page));
622 } 622 }
623 list_splice(&ret_pages, page_list); 623 list_splice(&ret_pages, page_list);
624 if (pagevec_count(&freed_pvec)) 624 if (pagevec_count(&freed_pvec))
625 __pagevec_release_nonlru(&freed_pvec); 625 __pagevec_release_nonlru(&freed_pvec);
626 count_vm_events(PGACTIVATE, pgactivate); 626 count_vm_events(PGACTIVATE, pgactivate);
627 return nr_reclaimed; 627 return nr_reclaimed;
628 } 628 }
629 629
630 /* LRU Isolation modes. */ 630 /* LRU Isolation modes. */
631 #define ISOLATE_INACTIVE 0 /* Isolate inactive pages. */ 631 #define ISOLATE_INACTIVE 0 /* Isolate inactive pages. */
632 #define ISOLATE_ACTIVE 1 /* Isolate active pages. */ 632 #define ISOLATE_ACTIVE 1 /* Isolate active pages. */
633 #define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */ 633 #define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */
634 634
635 /* 635 /*
636 * Attempt to remove the specified page from its LRU. Only take this page 636 * Attempt to remove the specified page from its LRU. Only take this page
637 * if it is of the appropriate PageActive status. Pages which are being 637 * if it is of the appropriate PageActive status. Pages which are being
638 * freed elsewhere are also ignored. 638 * freed elsewhere are also ignored.
639 * 639 *
640 * page: page to consider 640 * page: page to consider
641 * mode: one of the LRU isolation modes defined above 641 * mode: one of the LRU isolation modes defined above
642 * 642 *
643 * returns 0 on success, -ve errno on failure. 643 * returns 0 on success, -ve errno on failure.
644 */ 644 */
645 int __isolate_lru_page(struct page *page, int mode) 645 int __isolate_lru_page(struct page *page, int mode)
646 { 646 {
647 int ret = -EINVAL; 647 int ret = -EINVAL;
648 648
649 /* Only take pages on the LRU. */ 649 /* Only take pages on the LRU. */
650 if (!PageLRU(page)) 650 if (!PageLRU(page))
651 return ret; 651 return ret;
652 652
653 /* 653 /*
654 * When checking the active state, we need to be sure we are 654 * When checking the active state, we need to be sure we are
655 * dealing with comparible boolean values. Take the logical not 655 * dealing with comparible boolean values. Take the logical not
656 * of each. 656 * of each.
657 */ 657 */
658 if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode)) 658 if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode))
659 return ret; 659 return ret;
660 660
661 ret = -EBUSY; 661 ret = -EBUSY;
662 if (likely(get_page_unless_zero(page))) { 662 if (likely(get_page_unless_zero(page))) {
663 /* 663 /*
664 * Be careful not to clear PageLRU until after we're 664 * Be careful not to clear PageLRU until after we're
665 * sure the page is not being freed elsewhere -- the 665 * sure the page is not being freed elsewhere -- the
666 * page release code relies on it. 666 * page release code relies on it.
667 */ 667 */
668 ClearPageLRU(page); 668 ClearPageLRU(page);
669 ret = 0; 669 ret = 0;
670 } 670 }
671 671
672 return ret; 672 return ret;
673 } 673 }
674 674
675 /* 675 /*
676 * zone->lru_lock is heavily contended. Some of the functions that 676 * zone->lru_lock is heavily contended. Some of the functions that
677 * shrink the lists perform better by taking out a batch of pages 677 * shrink the lists perform better by taking out a batch of pages
678 * and working on them outside the LRU lock. 678 * and working on them outside the LRU lock.
679 * 679 *
680 * For pagecache intensive workloads, this function is the hottest 680 * For pagecache intensive workloads, this function is the hottest
681 * spot in the kernel (apart from copy_*_user functions). 681 * spot in the kernel (apart from copy_*_user functions).
682 * 682 *
683 * Appropriate locks must be held before calling this function. 683 * Appropriate locks must be held before calling this function.
684 * 684 *
685 * @nr_to_scan: The number of pages to look through on the list. 685 * @nr_to_scan: The number of pages to look through on the list.
686 * @src: The LRU list to pull pages off. 686 * @src: The LRU list to pull pages off.
687 * @dst: The temp list to put pages on to. 687 * @dst: The temp list to put pages on to.
688 * @scanned: The number of pages that were scanned. 688 * @scanned: The number of pages that were scanned.
689 * @order: The caller's attempted allocation order 689 * @order: The caller's attempted allocation order
690 * @mode: One of the LRU isolation modes 690 * @mode: One of the LRU isolation modes
691 * 691 *
692 * returns how many pages were moved onto *@dst. 692 * returns how many pages were moved onto *@dst.
693 */ 693 */
694 static unsigned long isolate_lru_pages(unsigned long nr_to_scan, 694 static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
695 struct list_head *src, struct list_head *dst, 695 struct list_head *src, struct list_head *dst,
696 unsigned long *scanned, int order, int mode) 696 unsigned long *scanned, int order, int mode)
697 { 697 {
698 unsigned long nr_taken = 0; 698 unsigned long nr_taken = 0;
699 unsigned long scan; 699 unsigned long scan;
700 700
701 for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) { 701 for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
702 struct page *page; 702 struct page *page;
703 unsigned long pfn; 703 unsigned long pfn;
704 unsigned long end_pfn; 704 unsigned long end_pfn;
705 unsigned long page_pfn; 705 unsigned long page_pfn;
706 int zone_id; 706 int zone_id;
707 707
708 page = lru_to_page(src); 708 page = lru_to_page(src);
709 prefetchw_prev_lru_page(page, src, flags); 709 prefetchw_prev_lru_page(page, src, flags);
710 710
711 VM_BUG_ON(!PageLRU(page)); 711 VM_BUG_ON(!PageLRU(page));
712 712
713 switch (__isolate_lru_page(page, mode)) { 713 switch (__isolate_lru_page(page, mode)) {
714 case 0: 714 case 0:
715 list_move(&page->lru, dst); 715 list_move(&page->lru, dst);
716 nr_taken++; 716 nr_taken++;
717 break; 717 break;
718 718
719 case -EBUSY: 719 case -EBUSY:
720 /* else it is being freed elsewhere */ 720 /* else it is being freed elsewhere */
721 list_move(&page->lru, src); 721 list_move(&page->lru, src);
722 continue; 722 continue;
723 723
724 default: 724 default:
725 BUG(); 725 BUG();
726 } 726 }
727 727
728 if (!order) 728 if (!order)
729 continue; 729 continue;
730 730
731 /* 731 /*
732 * Attempt to take all pages in the order aligned region 732 * Attempt to take all pages in the order aligned region
733 * surrounding the tag page. Only take those pages of 733 * surrounding the tag page. Only take those pages of
734 * the same active state as that tag page. We may safely 734 * the same active state as that tag page. We may safely
735 * round the target page pfn down to the requested order 735 * round the target page pfn down to the requested order
736 * as the mem_map is guarenteed valid out to MAX_ORDER, 736 * as the mem_map is guarenteed valid out to MAX_ORDER,
737 * where that page is in a different zone we will detect 737 * where that page is in a different zone we will detect
738 * it from its zone id and abort this block scan. 738 * it from its zone id and abort this block scan.
739 */ 739 */
740 zone_id = page_zone_id(page); 740 zone_id = page_zone_id(page);
741 page_pfn = page_to_pfn(page); 741 page_pfn = page_to_pfn(page);
742 pfn = page_pfn & ~((1 << order) - 1); 742 pfn = page_pfn & ~((1 << order) - 1);
743 end_pfn = pfn + (1 << order); 743 end_pfn = pfn + (1 << order);
744 for (; pfn < end_pfn; pfn++) { 744 for (; pfn < end_pfn; pfn++) {
745 struct page *cursor_page; 745 struct page *cursor_page;
746 746
747 /* The target page is in the block, ignore it. */ 747 /* The target page is in the block, ignore it. */
748 if (unlikely(pfn == page_pfn)) 748 if (unlikely(pfn == page_pfn))
749 continue; 749 continue;
750 750
751 /* Avoid holes within the zone. */ 751 /* Avoid holes within the zone. */
752 if (unlikely(!pfn_valid_within(pfn))) 752 if (unlikely(!pfn_valid_within(pfn)))
753 break; 753 break;
754 754
755 cursor_page = pfn_to_page(pfn); 755 cursor_page = pfn_to_page(pfn);
756 /* Check that we have not crossed a zone boundary. */ 756 /* Check that we have not crossed a zone boundary. */
757 if (unlikely(page_zone_id(cursor_page) != zone_id)) 757 if (unlikely(page_zone_id(cursor_page) != zone_id))
758 continue; 758 continue;
759 switch (__isolate_lru_page(cursor_page, mode)) { 759 switch (__isolate_lru_page(cursor_page, mode)) {
760 case 0: 760 case 0:
761 list_move(&cursor_page->lru, dst); 761 list_move(&cursor_page->lru, dst);
762 nr_taken++; 762 nr_taken++;
763 scan++; 763 scan++;
764 break; 764 break;
765 765
766 case -EBUSY: 766 case -EBUSY:
767 /* else it is being freed elsewhere */ 767 /* else it is being freed elsewhere */
768 list_move(&cursor_page->lru, src); 768 list_move(&cursor_page->lru, src);
769 default: 769 default:
770 break; 770 break;
771 } 771 }
772 } 772 }
773 } 773 }
774 774
775 *scanned = scan; 775 *scanned = scan;
776 return nr_taken; 776 return nr_taken;
777 } 777 }
778 778
779 static unsigned long isolate_pages_global(unsigned long nr, 779 static unsigned long isolate_pages_global(unsigned long nr,
780 struct list_head *dst, 780 struct list_head *dst,
781 unsigned long *scanned, int order, 781 unsigned long *scanned, int order,
782 int mode, struct zone *z, 782 int mode, struct zone *z,
783 struct mem_cgroup *mem_cont, 783 struct mem_cgroup *mem_cont,
784 int active) 784 int active)
785 { 785 {
786 if (active) 786 if (active)
787 return isolate_lru_pages(nr, &z->active_list, dst, 787 return isolate_lru_pages(nr, &z->active_list, dst,
788 scanned, order, mode); 788 scanned, order, mode);
789 else 789 else
790 return isolate_lru_pages(nr, &z->inactive_list, dst, 790 return isolate_lru_pages(nr, &z->inactive_list, dst,
791 scanned, order, mode); 791 scanned, order, mode);
792 } 792 }
793 793
794 /* 794 /*
795 * clear_active_flags() is a helper for shrink_active_list(), clearing 795 * clear_active_flags() is a helper for shrink_active_list(), clearing
796 * any active bits from the pages in the list. 796 * any active bits from the pages in the list.
797 */ 797 */
798 static unsigned long clear_active_flags(struct list_head *page_list) 798 static unsigned long clear_active_flags(struct list_head *page_list)
799 { 799 {
800 int nr_active = 0; 800 int nr_active = 0;
801 struct page *page; 801 struct page *page;
802 802
803 list_for_each_entry(page, page_list, lru) 803 list_for_each_entry(page, page_list, lru)
804 if (PageActive(page)) { 804 if (PageActive(page)) {
805 ClearPageActive(page); 805 ClearPageActive(page);
806 nr_active++; 806 nr_active++;
807 } 807 }
808 808
809 return nr_active; 809 return nr_active;
810 } 810 }
811 811
812 /* 812 /*
813 * shrink_inactive_list() is a helper for shrink_zone(). It returns the number 813 * shrink_inactive_list() is a helper for shrink_zone(). It returns the number
814 * of reclaimed pages 814 * of reclaimed pages
815 */ 815 */
816 static unsigned long shrink_inactive_list(unsigned long max_scan, 816 static unsigned long shrink_inactive_list(unsigned long max_scan,
817 struct zone *zone, struct scan_control *sc) 817 struct zone *zone, struct scan_control *sc)
818 { 818 {
819 LIST_HEAD(page_list); 819 LIST_HEAD(page_list);
820 struct pagevec pvec; 820 struct pagevec pvec;
821 unsigned long nr_scanned = 0; 821 unsigned long nr_scanned = 0;
822 unsigned long nr_reclaimed = 0; 822 unsigned long nr_reclaimed = 0;
823 823
824 pagevec_init(&pvec, 1); 824 pagevec_init(&pvec, 1);
825 825
826 lru_add_drain(); 826 lru_add_drain();
827 spin_lock_irq(&zone->lru_lock); 827 spin_lock_irq(&zone->lru_lock);
828 do { 828 do {
829 struct page *page; 829 struct page *page;
830 unsigned long nr_taken; 830 unsigned long nr_taken;
831 unsigned long nr_scan; 831 unsigned long nr_scan;
832 unsigned long nr_freed; 832 unsigned long nr_freed;
833 unsigned long nr_active; 833 unsigned long nr_active;
834 834
835 nr_taken = sc->isolate_pages(sc->swap_cluster_max, 835 nr_taken = sc->isolate_pages(sc->swap_cluster_max,
836 &page_list, &nr_scan, sc->order, 836 &page_list, &nr_scan, sc->order,
837 (sc->order > PAGE_ALLOC_COSTLY_ORDER)? 837 (sc->order > PAGE_ALLOC_COSTLY_ORDER)?
838 ISOLATE_BOTH : ISOLATE_INACTIVE, 838 ISOLATE_BOTH : ISOLATE_INACTIVE,
839 zone, sc->mem_cgroup, 0); 839 zone, sc->mem_cgroup, 0);
840 nr_active = clear_active_flags(&page_list); 840 nr_active = clear_active_flags(&page_list);
841 __count_vm_events(PGDEACTIVATE, nr_active); 841 __count_vm_events(PGDEACTIVATE, nr_active);
842 842
843 __mod_zone_page_state(zone, NR_ACTIVE, -nr_active); 843 __mod_zone_page_state(zone, NR_ACTIVE, -nr_active);
844 __mod_zone_page_state(zone, NR_INACTIVE, 844 __mod_zone_page_state(zone, NR_INACTIVE,
845 -(nr_taken - nr_active)); 845 -(nr_taken - nr_active));
846 if (scan_global_lru(sc)) 846 if (scan_global_lru(sc))
847 zone->pages_scanned += nr_scan; 847 zone->pages_scanned += nr_scan;
848 spin_unlock_irq(&zone->lru_lock); 848 spin_unlock_irq(&zone->lru_lock);
849 849
850 nr_scanned += nr_scan; 850 nr_scanned += nr_scan;
851 nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC); 851 nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
852 852
853 /* 853 /*
854 * If we are direct reclaiming for contiguous pages and we do 854 * If we are direct reclaiming for contiguous pages and we do
855 * not reclaim everything in the list, try again and wait 855 * not reclaim everything in the list, try again and wait
856 * for IO to complete. This will stall high-order allocations 856 * for IO to complete. This will stall high-order allocations
857 * but that should be acceptable to the caller 857 * but that should be acceptable to the caller
858 */ 858 */
859 if (nr_freed < nr_taken && !current_is_kswapd() && 859 if (nr_freed < nr_taken && !current_is_kswapd() &&
860 sc->order > PAGE_ALLOC_COSTLY_ORDER) { 860 sc->order > PAGE_ALLOC_COSTLY_ORDER) {
861 congestion_wait(WRITE, HZ/10); 861 congestion_wait(WRITE, HZ/10);
862 862
863 /* 863 /*
864 * The attempt at page out may have made some 864 * The attempt at page out may have made some
865 * of the pages active, mark them inactive again. 865 * of the pages active, mark them inactive again.
866 */ 866 */
867 nr_active = clear_active_flags(&page_list); 867 nr_active = clear_active_flags(&page_list);
868 count_vm_events(PGDEACTIVATE, nr_active); 868 count_vm_events(PGDEACTIVATE, nr_active);
869 869
870 nr_freed += shrink_page_list(&page_list, sc, 870 nr_freed += shrink_page_list(&page_list, sc,
871 PAGEOUT_IO_SYNC); 871 PAGEOUT_IO_SYNC);
872 } 872 }
873 873
874 nr_reclaimed += nr_freed; 874 nr_reclaimed += nr_freed;
875 local_irq_disable(); 875 local_irq_disable();
876 if (current_is_kswapd()) { 876 if (current_is_kswapd()) {
877 __count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scan); 877 __count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scan);
878 __count_vm_events(KSWAPD_STEAL, nr_freed); 878 __count_vm_events(KSWAPD_STEAL, nr_freed);
879 } else if (scan_global_lru(sc)) 879 } else if (scan_global_lru(sc))
880 __count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scan); 880 __count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scan);
881 881
882 __count_zone_vm_events(PGSTEAL, zone, nr_freed); 882 __count_zone_vm_events(PGSTEAL, zone, nr_freed);
883 883
884 if (nr_taken == 0) 884 if (nr_taken == 0)
885 goto done; 885 goto done;
886 886
887 spin_lock(&zone->lru_lock); 887 spin_lock(&zone->lru_lock);
888 /* 888 /*
889 * Put back any unfreeable pages. 889 * Put back any unfreeable pages.
890 */ 890 */
891 while (!list_empty(&page_list)) { 891 while (!list_empty(&page_list)) {
892 page = lru_to_page(&page_list); 892 page = lru_to_page(&page_list);
893 VM_BUG_ON(PageLRU(page)); 893 VM_BUG_ON(PageLRU(page));
894 SetPageLRU(page); 894 SetPageLRU(page);
895 list_del(&page->lru); 895 list_del(&page->lru);
896 if (PageActive(page)) 896 if (PageActive(page))
897 add_page_to_active_list(zone, page); 897 add_page_to_active_list(zone, page);
898 else 898 else
899 add_page_to_inactive_list(zone, page); 899 add_page_to_inactive_list(zone, page);
900 if (!pagevec_add(&pvec, page)) { 900 if (!pagevec_add(&pvec, page)) {
901 spin_unlock_irq(&zone->lru_lock); 901 spin_unlock_irq(&zone->lru_lock);
902 __pagevec_release(&pvec); 902 __pagevec_release(&pvec);
903 spin_lock_irq(&zone->lru_lock); 903 spin_lock_irq(&zone->lru_lock);
904 } 904 }
905 } 905 }
906 } while (nr_scanned < max_scan); 906 } while (nr_scanned < max_scan);
907 spin_unlock(&zone->lru_lock); 907 spin_unlock(&zone->lru_lock);
908 done: 908 done:
909 local_irq_enable(); 909 local_irq_enable();
910 pagevec_release(&pvec); 910 pagevec_release(&pvec);
911 return nr_reclaimed; 911 return nr_reclaimed;
912 } 912 }
913 913
914 /* 914 /*
915 * We are about to scan this zone at a certain priority level. If that priority 915 * We are about to scan this zone at a certain priority level. If that priority
916 * level is smaller (ie: more urgent) than the previous priority, then note 916 * level is smaller (ie: more urgent) than the previous priority, then note
917 * that priority level within the zone. This is done so that when the next 917 * that priority level within the zone. This is done so that when the next
918 * process comes in to scan this zone, it will immediately start out at this 918 * process comes in to scan this zone, it will immediately start out at this
919 * priority level rather than having to build up its own scanning priority. 919 * priority level rather than having to build up its own scanning priority.
920 * Here, this priority affects only the reclaim-mapped threshold. 920 * Here, this priority affects only the reclaim-mapped threshold.
921 */ 921 */
922 static inline void note_zone_scanning_priority(struct zone *zone, int priority) 922 static inline void note_zone_scanning_priority(struct zone *zone, int priority)
923 { 923 {
924 if (priority < zone->prev_priority) 924 if (priority < zone->prev_priority)
925 zone->prev_priority = priority; 925 zone->prev_priority = priority;
926 } 926 }
927 927
928 static inline int zone_is_near_oom(struct zone *zone) 928 static inline int zone_is_near_oom(struct zone *zone)
929 { 929 {
930 return zone->pages_scanned >= (zone_page_state(zone, NR_ACTIVE) 930 return zone->pages_scanned >= (zone_page_state(zone, NR_ACTIVE)
931 + zone_page_state(zone, NR_INACTIVE))*3; 931 + zone_page_state(zone, NR_INACTIVE))*3;
932 } 932 }
933 933
934 /* 934 /*
935 * Determine we should try to reclaim mapped pages. 935 * Determine we should try to reclaim mapped pages.
936 * This is called only when sc->mem_cgroup is NULL. 936 * This is called only when sc->mem_cgroup is NULL.
937 */ 937 */
938 static int calc_reclaim_mapped(struct scan_control *sc, struct zone *zone, 938 static int calc_reclaim_mapped(struct scan_control *sc, struct zone *zone,
939 int priority) 939 int priority)
940 { 940 {
941 long mapped_ratio; 941 long mapped_ratio;
942 long distress; 942 long distress;
943 long swap_tendency; 943 long swap_tendency;
944 long imbalance; 944 long imbalance;
945 int reclaim_mapped = 0; 945 int reclaim_mapped = 0;
946 int prev_priority; 946 int prev_priority;
947 947
948 if (scan_global_lru(sc) && zone_is_near_oom(zone)) 948 if (scan_global_lru(sc) && zone_is_near_oom(zone))
949 return 1; 949 return 1;
950 /* 950 /*
951 * `distress' is a measure of how much trouble we're having 951 * `distress' is a measure of how much trouble we're having
952 * reclaiming pages. 0 -> no problems. 100 -> great trouble. 952 * reclaiming pages. 0 -> no problems. 100 -> great trouble.
953 */ 953 */
954 if (scan_global_lru(sc)) 954 if (scan_global_lru(sc))
955 prev_priority = zone->prev_priority; 955 prev_priority = zone->prev_priority;
956 else 956 else
957 prev_priority = mem_cgroup_get_reclaim_priority(sc->mem_cgroup); 957 prev_priority = mem_cgroup_get_reclaim_priority(sc->mem_cgroup);
958 958
959 distress = 100 >> min(prev_priority, priority); 959 distress = 100 >> min(prev_priority, priority);
960 960
961 /* 961 /*
962 * The point of this algorithm is to decide when to start 962 * The point of this algorithm is to decide when to start
963 * reclaiming mapped memory instead of just pagecache. Work out 963 * reclaiming mapped memory instead of just pagecache. Work out
964 * how much memory 964 * how much memory
965 * is mapped. 965 * is mapped.
966 */ 966 */
967 if (scan_global_lru(sc)) 967 if (scan_global_lru(sc))
968 mapped_ratio = ((global_page_state(NR_FILE_MAPPED) + 968 mapped_ratio = ((global_page_state(NR_FILE_MAPPED) +
969 global_page_state(NR_ANON_PAGES)) * 100) / 969 global_page_state(NR_ANON_PAGES)) * 100) /
970 vm_total_pages; 970 vm_total_pages;
971 else 971 else
972 mapped_ratio = mem_cgroup_calc_mapped_ratio(sc->mem_cgroup); 972 mapped_ratio = mem_cgroup_calc_mapped_ratio(sc->mem_cgroup);
973 973
974 /* 974 /*
975 * Now decide how much we really want to unmap some pages. The 975 * Now decide how much we really want to unmap some pages. The
976 * mapped ratio is downgraded - just because there's a lot of 976 * mapped ratio is downgraded - just because there's a lot of
977 * mapped memory doesn't necessarily mean that page reclaim 977 * mapped memory doesn't necessarily mean that page reclaim
978 * isn't succeeding. 978 * isn't succeeding.
979 * 979 *
980 * The distress ratio is important - we don't want to start 980 * The distress ratio is important - we don't want to start
981 * going oom. 981 * going oom.
982 * 982 *
983 * A 100% value of vm_swappiness overrides this algorithm 983 * A 100% value of vm_swappiness overrides this algorithm
984 * altogether. 984 * altogether.
985 */ 985 */
986 swap_tendency = mapped_ratio / 2 + distress + sc->swappiness; 986 swap_tendency = mapped_ratio / 2 + distress + sc->swappiness;
987 987
988 /* 988 /*
989 * If there's huge imbalance between active and inactive 989 * If there's huge imbalance between active and inactive
990 * (think active 100 times larger than inactive) we should 990 * (think active 100 times larger than inactive) we should
991 * become more permissive, or the system will take too much 991 * become more permissive, or the system will take too much
992 * cpu before it start swapping during memory pressure. 992 * cpu before it start swapping during memory pressure.
993 * Distress is about avoiding early-oom, this is about 993 * Distress is about avoiding early-oom, this is about
994 * making swappiness graceful despite setting it to low 994 * making swappiness graceful despite setting it to low
995 * values. 995 * values.
996 * 996 *
997 * Avoid div by zero with nr_inactive+1, and max resulting 997 * Avoid div by zero with nr_inactive+1, and max resulting
998 * value is vm_total_pages. 998 * value is vm_total_pages.
999 */ 999 */
1000 if (scan_global_lru(sc)) { 1000 if (scan_global_lru(sc)) {
1001 imbalance = zone_page_state(zone, NR_ACTIVE); 1001 imbalance = zone_page_state(zone, NR_ACTIVE);
1002 imbalance /= zone_page_state(zone, NR_INACTIVE) + 1; 1002 imbalance /= zone_page_state(zone, NR_INACTIVE) + 1;
1003 } else 1003 } else
1004 imbalance = mem_cgroup_reclaim_imbalance(sc->mem_cgroup); 1004 imbalance = mem_cgroup_reclaim_imbalance(sc->mem_cgroup);
1005 1005
1006 /* 1006 /*
1007 * Reduce the effect of imbalance if swappiness is low, 1007 * Reduce the effect of imbalance if swappiness is low,
1008 * this means for a swappiness very low, the imbalance 1008 * this means for a swappiness very low, the imbalance
1009 * must be much higher than 100 for this logic to make 1009 * must be much higher than 100 for this logic to make
1010 * the difference. 1010 * the difference.
1011 * 1011 *
1012 * Max temporary value is vm_total_pages*100. 1012 * Max temporary value is vm_total_pages*100.
1013 */ 1013 */
1014 imbalance *= (vm_swappiness + 1); 1014 imbalance *= (vm_swappiness + 1);
1015 imbalance /= 100; 1015 imbalance /= 100;
1016 1016
1017 /* 1017 /*
1018 * If not much of the ram is mapped, makes the imbalance 1018 * If not much of the ram is mapped, makes the imbalance
1019 * less relevant, it's high priority we refill the inactive 1019 * less relevant, it's high priority we refill the inactive
1020 * list with mapped pages only in presence of high ratio of 1020 * list with mapped pages only in presence of high ratio of
1021 * mapped pages. 1021 * mapped pages.
1022 * 1022 *
1023 * Max temporary value is vm_total_pages*100. 1023 * Max temporary value is vm_total_pages*100.
1024 */ 1024 */
1025 imbalance *= mapped_ratio; 1025 imbalance *= mapped_ratio;
1026 imbalance /= 100; 1026 imbalance /= 100;
1027 1027
1028 /* apply imbalance feedback to swap_tendency */ 1028 /* apply imbalance feedback to swap_tendency */
1029 swap_tendency += imbalance; 1029 swap_tendency += imbalance;
1030 1030
1031 /* 1031 /*
1032 * Now use this metric to decide whether to start moving mapped 1032 * Now use this metric to decide whether to start moving mapped
1033 * memory onto the inactive list. 1033 * memory onto the inactive list.
1034 */ 1034 */
1035 if (swap_tendency >= 100) 1035 if (swap_tendency >= 100)
1036 reclaim_mapped = 1; 1036 reclaim_mapped = 1;
1037 1037
1038 return reclaim_mapped; 1038 return reclaim_mapped;
1039 } 1039 }
1040 1040
1041 /* 1041 /*
1042 * This moves pages from the active list to the inactive list. 1042 * This moves pages from the active list to the inactive list.
1043 * 1043 *
1044 * We move them the other way if the page is referenced by one or more 1044 * We move them the other way if the page is referenced by one or more
1045 * processes, from rmap. 1045 * processes, from rmap.
1046 * 1046 *
1047 * If the pages are mostly unmapped, the processing is fast and it is 1047 * If the pages are mostly unmapped, the processing is fast and it is
1048 * appropriate to hold zone->lru_lock across the whole operation. But if 1048 * appropriate to hold zone->lru_lock across the whole operation. But if
1049 * the pages are mapped, the processing is slow (page_referenced()) so we 1049 * the pages are mapped, the processing is slow (page_referenced()) so we
1050 * should drop zone->lru_lock around each page. It's impossible to balance 1050 * should drop zone->lru_lock around each page. It's impossible to balance
1051 * this, so instead we remove the pages from the LRU while processing them. 1051 * this, so instead we remove the pages from the LRU while processing them.
1052 * It is safe to rely on PG_active against the non-LRU pages in here because 1052 * It is safe to rely on PG_active against the non-LRU pages in here because
1053 * nobody will play with that bit on a non-LRU page. 1053 * nobody will play with that bit on a non-LRU page.
1054 * 1054 *
1055 * The downside is that we have to touch page->_count against each page. 1055 * The downside is that we have to touch page->_count against each page.
1056 * But we had to alter page->flags anyway. 1056 * But we had to alter page->flags anyway.
1057 */ 1057 */
1058 1058
1059 1059
1060 static void shrink_active_list(unsigned long nr_pages, struct zone *zone, 1060 static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
1061 struct scan_control *sc, int priority) 1061 struct scan_control *sc, int priority)
1062 { 1062 {
1063 unsigned long pgmoved; 1063 unsigned long pgmoved;
1064 int pgdeactivate = 0; 1064 int pgdeactivate = 0;
1065 unsigned long pgscanned; 1065 unsigned long pgscanned;
1066 LIST_HEAD(l_hold); /* The pages which were snipped off */ 1066 LIST_HEAD(l_hold); /* The pages which were snipped off */
1067 LIST_HEAD(l_inactive); /* Pages to go onto the inactive_list */ 1067 LIST_HEAD(l_inactive); /* Pages to go onto the inactive_list */
1068 LIST_HEAD(l_active); /* Pages to go onto the active_list */ 1068 LIST_HEAD(l_active); /* Pages to go onto the active_list */
1069 struct page *page; 1069 struct page *page;
1070 struct pagevec pvec; 1070 struct pagevec pvec;
1071 int reclaim_mapped = 0; 1071 int reclaim_mapped = 0;
1072 1072
1073 if (sc->may_swap) 1073 if (sc->may_swap)
1074 reclaim_mapped = calc_reclaim_mapped(sc, zone, priority); 1074 reclaim_mapped = calc_reclaim_mapped(sc, zone, priority);
1075 1075
1076 lru_add_drain(); 1076 lru_add_drain();
1077 spin_lock_irq(&zone->lru_lock); 1077 spin_lock_irq(&zone->lru_lock);
1078 pgmoved = sc->isolate_pages(nr_pages, &l_hold, &pgscanned, sc->order, 1078 pgmoved = sc->isolate_pages(nr_pages, &l_hold, &pgscanned, sc->order,
1079 ISOLATE_ACTIVE, zone, 1079 ISOLATE_ACTIVE, zone,
1080 sc->mem_cgroup, 1); 1080 sc->mem_cgroup, 1);
1081 /* 1081 /*
1082 * zone->pages_scanned is used for detect zone's oom 1082 * zone->pages_scanned is used for detect zone's oom
1083 * mem_cgroup remembers nr_scan by itself. 1083 * mem_cgroup remembers nr_scan by itself.
1084 */ 1084 */
1085 if (scan_global_lru(sc)) 1085 if (scan_global_lru(sc))
1086 zone->pages_scanned += pgscanned; 1086 zone->pages_scanned += pgscanned;
1087 1087
1088 __mod_zone_page_state(zone, NR_ACTIVE, -pgmoved); 1088 __mod_zone_page_state(zone, NR_ACTIVE, -pgmoved);
1089 spin_unlock_irq(&zone->lru_lock); 1089 spin_unlock_irq(&zone->lru_lock);
1090 1090
1091 while (!list_empty(&l_hold)) { 1091 while (!list_empty(&l_hold)) {
1092 cond_resched(); 1092 cond_resched();
1093 page = lru_to_page(&l_hold); 1093 page = lru_to_page(&l_hold);
1094 list_del(&page->lru); 1094 list_del(&page->lru);
1095 if (page_mapped(page)) { 1095 if (page_mapped(page)) {
1096 if (!reclaim_mapped || 1096 if (!reclaim_mapped ||
1097 (total_swap_pages == 0 && PageAnon(page)) || 1097 (total_swap_pages == 0 && PageAnon(page)) ||
1098 page_referenced(page, 0, sc->mem_cgroup)) { 1098 page_referenced(page, 0, sc->mem_cgroup)) {
1099 list_add(&page->lru, &l_active); 1099 list_add(&page->lru, &l_active);
1100 continue; 1100 continue;
1101 } 1101 }
1102 } 1102 }
1103 list_add(&page->lru, &l_inactive); 1103 list_add(&page->lru, &l_inactive);
1104 } 1104 }
1105 1105
1106 pagevec_init(&pvec, 1); 1106 pagevec_init(&pvec, 1);
1107 pgmoved = 0; 1107 pgmoved = 0;
1108 spin_lock_irq(&zone->lru_lock); 1108 spin_lock_irq(&zone->lru_lock);
1109 while (!list_empty(&l_inactive)) { 1109 while (!list_empty(&l_inactive)) {
1110 page = lru_to_page(&l_inactive); 1110 page = lru_to_page(&l_inactive);
1111 prefetchw_prev_lru_page(page, &l_inactive, flags); 1111 prefetchw_prev_lru_page(page, &l_inactive, flags);
1112 VM_BUG_ON(PageLRU(page)); 1112 VM_BUG_ON(PageLRU(page));
1113 SetPageLRU(page); 1113 SetPageLRU(page);
1114 VM_BUG_ON(!PageActive(page)); 1114 VM_BUG_ON(!PageActive(page));
1115 ClearPageActive(page); 1115 ClearPageActive(page);
1116 1116
1117 list_move(&page->lru, &zone->inactive_list); 1117 list_move(&page->lru, &zone->inactive_list);
1118 mem_cgroup_move_lists(page, false); 1118 mem_cgroup_move_lists(page, false);
1119 pgmoved++; 1119 pgmoved++;
1120 if (!pagevec_add(&pvec, page)) { 1120 if (!pagevec_add(&pvec, page)) {
1121 __mod_zone_page_state(zone, NR_INACTIVE, pgmoved); 1121 __mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
1122 spin_unlock_irq(&zone->lru_lock); 1122 spin_unlock_irq(&zone->lru_lock);
1123 pgdeactivate += pgmoved; 1123 pgdeactivate += pgmoved;
1124 pgmoved = 0; 1124 pgmoved = 0;
1125 if (buffer_heads_over_limit) 1125 if (buffer_heads_over_limit)
1126 pagevec_strip(&pvec); 1126 pagevec_strip(&pvec);
1127 __pagevec_release(&pvec); 1127 __pagevec_release(&pvec);
1128 spin_lock_irq(&zone->lru_lock); 1128 spin_lock_irq(&zone->lru_lock);
1129 } 1129 }
1130 } 1130 }
1131 __mod_zone_page_state(zone, NR_INACTIVE, pgmoved); 1131 __mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
1132 pgdeactivate += pgmoved; 1132 pgdeactivate += pgmoved;
1133 if (buffer_heads_over_limit) { 1133 if (buffer_heads_over_limit) {
1134 spin_unlock_irq(&zone->lru_lock); 1134 spin_unlock_irq(&zone->lru_lock);
1135 pagevec_strip(&pvec); 1135 pagevec_strip(&pvec);
1136 spin_lock_irq(&zone->lru_lock); 1136 spin_lock_irq(&zone->lru_lock);
1137 } 1137 }
1138 1138
1139 pgmoved = 0; 1139 pgmoved = 0;
1140 while (!list_empty(&l_active)) { 1140 while (!list_empty(&l_active)) {
1141 page = lru_to_page(&l_active); 1141 page = lru_to_page(&l_active);
1142 prefetchw_prev_lru_page(page, &l_active, flags); 1142 prefetchw_prev_lru_page(page, &l_active, flags);
1143 VM_BUG_ON(PageLRU(page)); 1143 VM_BUG_ON(PageLRU(page));
1144 SetPageLRU(page); 1144 SetPageLRU(page);
1145 VM_BUG_ON(!PageActive(page)); 1145 VM_BUG_ON(!PageActive(page));
1146 1146
1147 list_move(&page->lru, &zone->active_list); 1147 list_move(&page->lru, &zone->active_list);
1148 mem_cgroup_move_lists(page, true); 1148 mem_cgroup_move_lists(page, true);
1149 pgmoved++; 1149 pgmoved++;
1150 if (!pagevec_add(&pvec, page)) { 1150 if (!pagevec_add(&pvec, page)) {
1151 __mod_zone_page_state(zone, NR_ACTIVE, pgmoved); 1151 __mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
1152 pgmoved = 0; 1152 pgmoved = 0;
1153 spin_unlock_irq(&zone->lru_lock); 1153 spin_unlock_irq(&zone->lru_lock);
1154 __pagevec_release(&pvec); 1154 __pagevec_release(&pvec);
1155 spin_lock_irq(&zone->lru_lock); 1155 spin_lock_irq(&zone->lru_lock);
1156 } 1156 }
1157 } 1157 }
1158 __mod_zone_page_state(zone, NR_ACTIVE, pgmoved); 1158 __mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
1159 1159
1160 __count_zone_vm_events(PGREFILL, zone, pgscanned); 1160 __count_zone_vm_events(PGREFILL, zone, pgscanned);
1161 __count_vm_events(PGDEACTIVATE, pgdeactivate); 1161 __count_vm_events(PGDEACTIVATE, pgdeactivate);
1162 spin_unlock_irq(&zone->lru_lock); 1162 spin_unlock_irq(&zone->lru_lock);
1163 1163
1164 pagevec_release(&pvec); 1164 pagevec_release(&pvec);
1165 } 1165 }
1166 1166
1167 /* 1167 /*
1168 * This is a basic per-zone page freer. Used by both kswapd and direct reclaim. 1168 * This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
1169 */ 1169 */
1170 static unsigned long shrink_zone(int priority, struct zone *zone, 1170 static unsigned long shrink_zone(int priority, struct zone *zone,
1171 struct scan_control *sc) 1171 struct scan_control *sc)
1172 { 1172 {
1173 unsigned long nr_active; 1173 unsigned long nr_active;
1174 unsigned long nr_inactive; 1174 unsigned long nr_inactive;
1175 unsigned long nr_to_scan; 1175 unsigned long nr_to_scan;
1176 unsigned long nr_reclaimed = 0; 1176 unsigned long nr_reclaimed = 0;
1177 1177
1178 if (scan_global_lru(sc)) { 1178 if (scan_global_lru(sc)) {
1179 /* 1179 /*
1180 * Add one to nr_to_scan just to make sure that the kernel 1180 * Add one to nr_to_scan just to make sure that the kernel
1181 * will slowly sift through the active list. 1181 * will slowly sift through the active list.
1182 */ 1182 */
1183 zone->nr_scan_active += 1183 zone->nr_scan_active +=
1184 (zone_page_state(zone, NR_ACTIVE) >> priority) + 1; 1184 (zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
1185 nr_active = zone->nr_scan_active; 1185 nr_active = zone->nr_scan_active;
1186 zone->nr_scan_inactive += 1186 zone->nr_scan_inactive +=
1187 (zone_page_state(zone, NR_INACTIVE) >> priority) + 1; 1187 (zone_page_state(zone, NR_INACTIVE) >> priority) + 1;
1188 nr_inactive = zone->nr_scan_inactive; 1188 nr_inactive = zone->nr_scan_inactive;
1189 if (nr_inactive >= sc->swap_cluster_max) 1189 if (nr_inactive >= sc->swap_cluster_max)
1190 zone->nr_scan_inactive = 0; 1190 zone->nr_scan_inactive = 0;
1191 else 1191 else
1192 nr_inactive = 0; 1192 nr_inactive = 0;
1193 1193
1194 if (nr_active >= sc->swap_cluster_max) 1194 if (nr_active >= sc->swap_cluster_max)
1195 zone->nr_scan_active = 0; 1195 zone->nr_scan_active = 0;
1196 else 1196 else
1197 nr_active = 0; 1197 nr_active = 0;
1198 } else { 1198 } else {
1199 /* 1199 /*
1200 * This reclaim occurs not because zone memory shortage but 1200 * This reclaim occurs not because zone memory shortage but
1201 * because memory controller hits its limit. 1201 * because memory controller hits its limit.
1202 * Then, don't modify zone reclaim related data. 1202 * Then, don't modify zone reclaim related data.
1203 */ 1203 */
1204 nr_active = mem_cgroup_calc_reclaim_active(sc->mem_cgroup, 1204 nr_active = mem_cgroup_calc_reclaim_active(sc->mem_cgroup,
1205 zone, priority); 1205 zone, priority);
1206 1206
1207 nr_inactive = mem_cgroup_calc_reclaim_inactive(sc->mem_cgroup, 1207 nr_inactive = mem_cgroup_calc_reclaim_inactive(sc->mem_cgroup,
1208 zone, priority); 1208 zone, priority);
1209 } 1209 }
1210 1210
1211 1211
1212 while (nr_active || nr_inactive) { 1212 while (nr_active || nr_inactive) {
1213 if (nr_active) { 1213 if (nr_active) {
1214 nr_to_scan = min(nr_active, 1214 nr_to_scan = min(nr_active,
1215 (unsigned long)sc->swap_cluster_max); 1215 (unsigned long)sc->swap_cluster_max);
1216 nr_active -= nr_to_scan; 1216 nr_active -= nr_to_scan;
1217 shrink_active_list(nr_to_scan, zone, sc, priority); 1217 shrink_active_list(nr_to_scan, zone, sc, priority);
1218 } 1218 }
1219 1219
1220 if (nr_inactive) { 1220 if (nr_inactive) {
1221 nr_to_scan = min(nr_inactive, 1221 nr_to_scan = min(nr_inactive,
1222 (unsigned long)sc->swap_cluster_max); 1222 (unsigned long)sc->swap_cluster_max);
1223 nr_inactive -= nr_to_scan; 1223 nr_inactive -= nr_to_scan;
1224 nr_reclaimed += shrink_inactive_list(nr_to_scan, zone, 1224 nr_reclaimed += shrink_inactive_list(nr_to_scan, zone,
1225 sc); 1225 sc);
1226 } 1226 }
1227 } 1227 }
1228 1228
1229 throttle_vm_writeout(sc->gfp_mask); 1229 throttle_vm_writeout(sc->gfp_mask);
1230 return nr_reclaimed; 1230 return nr_reclaimed;
1231 } 1231 }
1232 1232
1233 /* 1233 /*
1234 * This is the direct reclaim path, for page-allocating processes. We only 1234 * This is the direct reclaim path, for page-allocating processes. We only
1235 * try to reclaim pages from zones which will satisfy the caller's allocation 1235 * try to reclaim pages from zones which will satisfy the caller's allocation
1236 * request. 1236 * request.
1237 * 1237 *
1238 * We reclaim from a zone even if that zone is over pages_high. Because: 1238 * We reclaim from a zone even if that zone is over pages_high. Because:
1239 * a) The caller may be trying to free *extra* pages to satisfy a higher-order 1239 * a) The caller may be trying to free *extra* pages to satisfy a higher-order
1240 * allocation or 1240 * allocation or
1241 * b) The zones may be over pages_high but they must go *over* pages_high to 1241 * b) The zones may be over pages_high but they must go *over* pages_high to
1242 * satisfy the `incremental min' zone defense algorithm. 1242 * satisfy the `incremental min' zone defense algorithm.
1243 * 1243 *
1244 * Returns the number of reclaimed pages. 1244 * Returns the number of reclaimed pages.
1245 * 1245 *
1246 * If a zone is deemed to be full of pinned pages then just give it a light 1246 * If a zone is deemed to be full of pinned pages then just give it a light
1247 * scan then give up on it. 1247 * scan then give up on it.
1248 */ 1248 */
1249 static unsigned long shrink_zones(int priority, struct zonelist *zonelist, 1249 static unsigned long shrink_zones(int priority, struct zonelist *zonelist,
1250 struct scan_control *sc) 1250 struct scan_control *sc)
1251 { 1251 {
1252 enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask); 1252 enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
1253 unsigned long nr_reclaimed = 0; 1253 unsigned long nr_reclaimed = 0;
1254 struct zoneref *z; 1254 struct zoneref *z;
1255 struct zone *zone; 1255 struct zone *zone;
1256 1256
1257 sc->all_unreclaimable = 1; 1257 sc->all_unreclaimable = 1;
1258 for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { 1258 for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
1259 if (!populated_zone(zone)) 1259 if (!populated_zone(zone))
1260 continue; 1260 continue;
1261 /* 1261 /*
1262 * Take care memory controller reclaiming has small influence 1262 * Take care memory controller reclaiming has small influence
1263 * to global LRU. 1263 * to global LRU.
1264 */ 1264 */
1265 if (scan_global_lru(sc)) { 1265 if (scan_global_lru(sc)) {
1266 if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) 1266 if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
1267 continue; 1267 continue;
1268 note_zone_scanning_priority(zone, priority); 1268 note_zone_scanning_priority(zone, priority);
1269 1269
1270 if (zone_is_all_unreclaimable(zone) && 1270 if (zone_is_all_unreclaimable(zone) &&
1271 priority != DEF_PRIORITY) 1271 priority != DEF_PRIORITY)
1272 continue; /* Let kswapd poll it */ 1272 continue; /* Let kswapd poll it */
1273 sc->all_unreclaimable = 0; 1273 sc->all_unreclaimable = 0;
1274 } else { 1274 } else {
1275 /* 1275 /*
1276 * Ignore cpuset limitation here. We just want to reduce 1276 * Ignore cpuset limitation here. We just want to reduce
1277 * # of used pages by us regardless of memory shortage. 1277 * # of used pages by us regardless of memory shortage.
1278 */ 1278 */
1279 sc->all_unreclaimable = 0; 1279 sc->all_unreclaimable = 0;
1280 mem_cgroup_note_reclaim_priority(sc->mem_cgroup, 1280 mem_cgroup_note_reclaim_priority(sc->mem_cgroup,
1281 priority); 1281 priority);
1282 } 1282 }
1283 1283
1284 nr_reclaimed += shrink_zone(priority, zone, sc); 1284 nr_reclaimed += shrink_zone(priority, zone, sc);
1285 } 1285 }
1286 1286
1287 return nr_reclaimed; 1287 return nr_reclaimed;
1288 } 1288 }
1289 1289
1290 /* 1290 /*
1291 * This is the main entry point to direct page reclaim. 1291 * This is the main entry point to direct page reclaim.
1292 * 1292 *
1293 * If a full scan of the inactive list fails to free enough memory then we 1293 * If a full scan of the inactive list fails to free enough memory then we
1294 * are "out of memory" and something needs to be killed. 1294 * are "out of memory" and something needs to be killed.
1295 * 1295 *
1296 * If the caller is !__GFP_FS then the probability of a failure is reasonably 1296 * If the caller is !__GFP_FS then the probability of a failure is reasonably
1297 * high - the zone may be full of dirty or under-writeback pages, which this 1297 * high - the zone may be full of dirty or under-writeback pages, which this
1298 * caller can't do much about. We kick pdflush and take explicit naps in the 1298 * caller can't do much about. We kick pdflush and take explicit naps in the
1299 * hope that some of these pages can be written. But if the allocating task 1299 * hope that some of these pages can be written. But if the allocating task
1300 * holds filesystem locks which prevent writeout this might not work, and the 1300 * holds filesystem locks which prevent writeout this might not work, and the
1301 * allocation attempt will fail. 1301 * allocation attempt will fail.
1302 *
1303 * returns: 0, if no pages reclaimed
1304 * else, the number of pages reclaimed
1302 */ 1305 */
1303 static unsigned long do_try_to_free_pages(struct zonelist *zonelist, 1306 static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
1304 struct scan_control *sc) 1307 struct scan_control *sc)
1305 { 1308 {
1306 int priority; 1309 int priority;
1307 int ret = 0; 1310 int ret = 0;
1308 unsigned long total_scanned = 0; 1311 unsigned long total_scanned = 0;
1309 unsigned long nr_reclaimed = 0; 1312 unsigned long nr_reclaimed = 0;
1310 struct reclaim_state *reclaim_state = current->reclaim_state; 1313 struct reclaim_state *reclaim_state = current->reclaim_state;
1311 unsigned long lru_pages = 0; 1314 unsigned long lru_pages = 0;
1312 struct zoneref *z; 1315 struct zoneref *z;
1313 struct zone *zone; 1316 struct zone *zone;
1314 enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask); 1317 enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
1315 1318
1316 if (scan_global_lru(sc)) 1319 if (scan_global_lru(sc))
1317 count_vm_event(ALLOCSTALL); 1320 count_vm_event(ALLOCSTALL);
1318 /* 1321 /*
1319 * mem_cgroup will not do shrink_slab. 1322 * mem_cgroup will not do shrink_slab.
1320 */ 1323 */
1321 if (scan_global_lru(sc)) { 1324 if (scan_global_lru(sc)) {
1322 for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { 1325 for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
1323 1326
1324 if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) 1327 if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
1325 continue; 1328 continue;
1326 1329
1327 lru_pages += zone_page_state(zone, NR_ACTIVE) 1330 lru_pages += zone_page_state(zone, NR_ACTIVE)
1328 + zone_page_state(zone, NR_INACTIVE); 1331 + zone_page_state(zone, NR_INACTIVE);
1329 } 1332 }
1330 } 1333 }
1331 1334
1332 for (priority = DEF_PRIORITY; priority >= 0; priority--) { 1335 for (priority = DEF_PRIORITY; priority >= 0; priority--) {
1333 sc->nr_scanned = 0; 1336 sc->nr_scanned = 0;
1334 if (!priority) 1337 if (!priority)
1335 disable_swap_token(); 1338 disable_swap_token();
1336 nr_reclaimed += shrink_zones(priority, zonelist, sc); 1339 nr_reclaimed += shrink_zones(priority, zonelist, sc);
1337 /* 1340 /*
1338 * Don't shrink slabs when reclaiming memory from 1341 * Don't shrink slabs when reclaiming memory from
1339 * over limit cgroups 1342 * over limit cgroups
1340 */ 1343 */
1341 if (scan_global_lru(sc)) { 1344 if (scan_global_lru(sc)) {
1342 shrink_slab(sc->nr_scanned, sc->gfp_mask, lru_pages); 1345 shrink_slab(sc->nr_scanned, sc->gfp_mask, lru_pages);
1343 if (reclaim_state) { 1346 if (reclaim_state) {
1344 nr_reclaimed += reclaim_state->reclaimed_slab; 1347 nr_reclaimed += reclaim_state->reclaimed_slab;
1345 reclaim_state->reclaimed_slab = 0; 1348 reclaim_state->reclaimed_slab = 0;
1346 } 1349 }
1347 } 1350 }
1348 total_scanned += sc->nr_scanned; 1351 total_scanned += sc->nr_scanned;
1349 if (nr_reclaimed >= sc->swap_cluster_max) { 1352 if (nr_reclaimed >= sc->swap_cluster_max) {
1350 ret = 1; 1353 ret = nr_reclaimed;
1351 goto out; 1354 goto out;
1352 } 1355 }
1353 1356
1354 /* 1357 /*
1355 * Try to write back as many pages as we just scanned. This 1358 * Try to write back as many pages as we just scanned. This
1356 * tends to cause slow streaming writers to write data to the 1359 * tends to cause slow streaming writers to write data to the
1357 * disk smoothly, at the dirtying rate, which is nice. But 1360 * disk smoothly, at the dirtying rate, which is nice. But
1358 * that's undesirable in laptop mode, where we *want* lumpy 1361 * that's undesirable in laptop mode, where we *want* lumpy
1359 * writeout. So in laptop mode, write out the whole world. 1362 * writeout. So in laptop mode, write out the whole world.
1360 */ 1363 */
1361 if (total_scanned > sc->swap_cluster_max + 1364 if (total_scanned > sc->swap_cluster_max +
1362 sc->swap_cluster_max / 2) { 1365 sc->swap_cluster_max / 2) {
1363 wakeup_pdflush(laptop_mode ? 0 : total_scanned); 1366 wakeup_pdflush(laptop_mode ? 0 : total_scanned);
1364 sc->may_writepage = 1; 1367 sc->may_writepage = 1;
1365 } 1368 }
1366 1369
1367 /* Take a nap, wait for some writeback to complete */ 1370 /* Take a nap, wait for some writeback to complete */
1368 if (sc->nr_scanned && priority < DEF_PRIORITY - 2) 1371 if (sc->nr_scanned && priority < DEF_PRIORITY - 2)
1369 congestion_wait(WRITE, HZ/10); 1372 congestion_wait(WRITE, HZ/10);
1370 } 1373 }
1371 /* top priority shrink_caches still had more to do? don't OOM, then */ 1374 /* top priority shrink_caches still had more to do? don't OOM, then */
1372 if (!sc->all_unreclaimable && scan_global_lru(sc)) 1375 if (!sc->all_unreclaimable && scan_global_lru(sc))
1373 ret = 1; 1376 ret = nr_reclaimed;
1374 out: 1377 out:
1375 /* 1378 /*
1376 * Now that we've scanned all the zones at this priority level, note 1379 * Now that we've scanned all the zones at this priority level, note
1377 * that level within the zone so that the next thread which performs 1380 * that level within the zone so that the next thread which performs
1378 * scanning of this zone will immediately start out at this priority 1381 * scanning of this zone will immediately start out at this priority
1379 * level. This affects only the decision whether or not to bring 1382 * level. This affects only the decision whether or not to bring
1380 * mapped pages onto the inactive list. 1383 * mapped pages onto the inactive list.
1381 */ 1384 */
1382 if (priority < 0) 1385 if (priority < 0)
1383 priority = 0; 1386 priority = 0;
1384 1387
1385 if (scan_global_lru(sc)) { 1388 if (scan_global_lru(sc)) {
1386 for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { 1389 for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
1387 1390
1388 if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) 1391 if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
1389 continue; 1392 continue;
1390 1393
1391 zone->prev_priority = priority; 1394 zone->prev_priority = priority;
1392 } 1395 }
1393 } else 1396 } else
1394 mem_cgroup_record_reclaim_priority(sc->mem_cgroup, priority); 1397 mem_cgroup_record_reclaim_priority(sc->mem_cgroup, priority);
1395 1398
1396 return ret; 1399 return ret;
1397 } 1400 }
1398 1401
1399 unsigned long try_to_free_pages(struct zonelist *zonelist, int order, 1402 unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
1400 gfp_t gfp_mask) 1403 gfp_t gfp_mask)
1401 { 1404 {
1402 struct scan_control sc = { 1405 struct scan_control sc = {
1403 .gfp_mask = gfp_mask, 1406 .gfp_mask = gfp_mask,
1404 .may_writepage = !laptop_mode, 1407 .may_writepage = !laptop_mode,
1405 .swap_cluster_max = SWAP_CLUSTER_MAX, 1408 .swap_cluster_max = SWAP_CLUSTER_MAX,
1406 .may_swap = 1, 1409 .may_swap = 1,
1407 .swappiness = vm_swappiness, 1410 .swappiness = vm_swappiness,
1408 .order = order, 1411 .order = order,
1409 .mem_cgroup = NULL, 1412 .mem_cgroup = NULL,
1410 .isolate_pages = isolate_pages_global, 1413 .isolate_pages = isolate_pages_global,
1411 }; 1414 };
1412 1415
1413 return do_try_to_free_pages(zonelist, &sc); 1416 return do_try_to_free_pages(zonelist, &sc);
1414 } 1417 }
1415 1418
1416 #ifdef CONFIG_CGROUP_MEM_RES_CTLR 1419 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
1417 1420
1418 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, 1421 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
1419 gfp_t gfp_mask) 1422 gfp_t gfp_mask)
1420 { 1423 {
1421 struct scan_control sc = { 1424 struct scan_control sc = {
1422 .may_writepage = !laptop_mode, 1425 .may_writepage = !laptop_mode,
1423 .may_swap = 1, 1426 .may_swap = 1,
1424 .swap_cluster_max = SWAP_CLUSTER_MAX, 1427 .swap_cluster_max = SWAP_CLUSTER_MAX,
1425 .swappiness = vm_swappiness, 1428 .swappiness = vm_swappiness,
1426 .order = 0, 1429 .order = 0,
1427 .mem_cgroup = mem_cont, 1430 .mem_cgroup = mem_cont,
1428 .isolate_pages = mem_cgroup_isolate_pages, 1431 .isolate_pages = mem_cgroup_isolate_pages,
1429 }; 1432 };
1430 struct zonelist *zonelist; 1433 struct zonelist *zonelist;
1431 1434
1432 sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | 1435 sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
1433 (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK); 1436 (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
1434 zonelist = NODE_DATA(numa_node_id())->node_zonelists; 1437 zonelist = NODE_DATA(numa_node_id())->node_zonelists;
1435 return do_try_to_free_pages(zonelist, &sc); 1438 return do_try_to_free_pages(zonelist, &sc);
1436 } 1439 }
1437 #endif 1440 #endif
1438 1441
1439 /* 1442 /*
1440 * For kswapd, balance_pgdat() will work across all this node's zones until 1443 * For kswapd, balance_pgdat() will work across all this node's zones until
1441 * they are all at pages_high. 1444 * they are all at pages_high.
1442 * 1445 *
1443 * Returns the number of pages which were actually freed. 1446 * Returns the number of pages which were actually freed.
1444 * 1447 *
1445 * There is special handling here for zones which are full of pinned pages. 1448 * There is special handling here for zones which are full of pinned pages.
1446 * This can happen if the pages are all mlocked, or if they are all used by 1449 * This can happen if the pages are all mlocked, or if they are all used by
1447 * device drivers (say, ZONE_DMA). Or if they are all in use by hugetlb. 1450 * device drivers (say, ZONE_DMA). Or if they are all in use by hugetlb.
1448 * What we do is to detect the case where all pages in the zone have been 1451 * What we do is to detect the case where all pages in the zone have been
1449 * scanned twice and there has been zero successful reclaim. Mark the zone as 1452 * scanned twice and there has been zero successful reclaim. Mark the zone as
1450 * dead and from now on, only perform a short scan. Basically we're polling 1453 * dead and from now on, only perform a short scan. Basically we're polling
1451 * the zone for when the problem goes away. 1454 * the zone for when the problem goes away.
1452 * 1455 *
1453 * kswapd scans the zones in the highmem->normal->dma direction. It skips 1456 * kswapd scans the zones in the highmem->normal->dma direction. It skips
1454 * zones which have free_pages > pages_high, but once a zone is found to have 1457 * zones which have free_pages > pages_high, but once a zone is found to have
1455 * free_pages <= pages_high, we scan that zone and the lower zones regardless 1458 * free_pages <= pages_high, we scan that zone and the lower zones regardless
1456 * of the number of free pages in the lower zones. This interoperates with 1459 * of the number of free pages in the lower zones. This interoperates with
1457 * the page allocator fallback scheme to ensure that aging of pages is balanced 1460 * the page allocator fallback scheme to ensure that aging of pages is balanced
1458 * across the zones. 1461 * across the zones.
1459 */ 1462 */
1460 static unsigned long balance_pgdat(pg_data_t *pgdat, int order) 1463 static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
1461 { 1464 {
1462 int all_zones_ok; 1465 int all_zones_ok;
1463 int priority; 1466 int priority;
1464 int i; 1467 int i;
1465 unsigned long total_scanned; 1468 unsigned long total_scanned;
1466 unsigned long nr_reclaimed; 1469 unsigned long nr_reclaimed;
1467 struct reclaim_state *reclaim_state = current->reclaim_state; 1470 struct reclaim_state *reclaim_state = current->reclaim_state;
1468 struct scan_control sc = { 1471 struct scan_control sc = {
1469 .gfp_mask = GFP_KERNEL, 1472 .gfp_mask = GFP_KERNEL,
1470 .may_swap = 1, 1473 .may_swap = 1,
1471 .swap_cluster_max = SWAP_CLUSTER_MAX, 1474 .swap_cluster_max = SWAP_CLUSTER_MAX,
1472 .swappiness = vm_swappiness, 1475 .swappiness = vm_swappiness,
1473 .order = order, 1476 .order = order,
1474 .mem_cgroup = NULL, 1477 .mem_cgroup = NULL,
1475 .isolate_pages = isolate_pages_global, 1478 .isolate_pages = isolate_pages_global,
1476 }; 1479 };
1477 /* 1480 /*
1478 * temp_priority is used to remember the scanning priority at which 1481 * temp_priority is used to remember the scanning priority at which
1479 * this zone was successfully refilled to free_pages == pages_high. 1482 * this zone was successfully refilled to free_pages == pages_high.
1480 */ 1483 */
1481 int temp_priority[MAX_NR_ZONES]; 1484 int temp_priority[MAX_NR_ZONES];
1482 1485
1483 loop_again: 1486 loop_again:
1484 total_scanned = 0; 1487 total_scanned = 0;
1485 nr_reclaimed = 0; 1488 nr_reclaimed = 0;
1486 sc.may_writepage = !laptop_mode; 1489 sc.may_writepage = !laptop_mode;
1487 count_vm_event(PAGEOUTRUN); 1490 count_vm_event(PAGEOUTRUN);
1488 1491
1489 for (i = 0; i < pgdat->nr_zones; i++) 1492 for (i = 0; i < pgdat->nr_zones; i++)
1490 temp_priority[i] = DEF_PRIORITY; 1493 temp_priority[i] = DEF_PRIORITY;
1491 1494
1492 for (priority = DEF_PRIORITY; priority >= 0; priority--) { 1495 for (priority = DEF_PRIORITY; priority >= 0; priority--) {
1493 int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ 1496 int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
1494 unsigned long lru_pages = 0; 1497 unsigned long lru_pages = 0;
1495 1498
1496 /* The swap token gets in the way of swapout... */ 1499 /* The swap token gets in the way of swapout... */
1497 if (!priority) 1500 if (!priority)
1498 disable_swap_token(); 1501 disable_swap_token();
1499 1502
1500 all_zones_ok = 1; 1503 all_zones_ok = 1;
1501 1504
1502 /* 1505 /*
1503 * Scan in the highmem->dma direction for the highest 1506 * Scan in the highmem->dma direction for the highest
1504 * zone which needs scanning 1507 * zone which needs scanning
1505 */ 1508 */
1506 for (i = pgdat->nr_zones - 1; i >= 0; i--) { 1509 for (i = pgdat->nr_zones - 1; i >= 0; i--) {
1507 struct zone *zone = pgdat->node_zones + i; 1510 struct zone *zone = pgdat->node_zones + i;
1508 1511
1509 if (!populated_zone(zone)) 1512 if (!populated_zone(zone))
1510 continue; 1513 continue;
1511 1514
1512 if (zone_is_all_unreclaimable(zone) && 1515 if (zone_is_all_unreclaimable(zone) &&
1513 priority != DEF_PRIORITY) 1516 priority != DEF_PRIORITY)
1514 continue; 1517 continue;
1515 1518
1516 if (!zone_watermark_ok(zone, order, zone->pages_high, 1519 if (!zone_watermark_ok(zone, order, zone->pages_high,
1517 0, 0)) { 1520 0, 0)) {
1518 end_zone = i; 1521 end_zone = i;
1519 break; 1522 break;
1520 } 1523 }
1521 } 1524 }
1522 if (i < 0) 1525 if (i < 0)
1523 goto out; 1526 goto out;
1524 1527
1525 for (i = 0; i <= end_zone; i++) { 1528 for (i = 0; i <= end_zone; i++) {
1526 struct zone *zone = pgdat->node_zones + i; 1529 struct zone *zone = pgdat->node_zones + i;
1527 1530
1528 lru_pages += zone_page_state(zone, NR_ACTIVE) 1531 lru_pages += zone_page_state(zone, NR_ACTIVE)
1529 + zone_page_state(zone, NR_INACTIVE); 1532 + zone_page_state(zone, NR_INACTIVE);
1530 } 1533 }
1531 1534
1532 /* 1535 /*
1533 * Now scan the zone in the dma->highmem direction, stopping 1536 * Now scan the zone in the dma->highmem direction, stopping
1534 * at the last zone which needs scanning. 1537 * at the last zone which needs scanning.
1535 * 1538 *
1536 * We do this because the page allocator works in the opposite 1539 * We do this because the page allocator works in the opposite
1537 * direction. This prevents the page allocator from allocating 1540 * direction. This prevents the page allocator from allocating
1538 * pages behind kswapd's direction of progress, which would 1541 * pages behind kswapd's direction of progress, which would
1539 * cause too much scanning of the lower zones. 1542 * cause too much scanning of the lower zones.
1540 */ 1543 */
1541 for (i = 0; i <= end_zone; i++) { 1544 for (i = 0; i <= end_zone; i++) {
1542 struct zone *zone = pgdat->node_zones + i; 1545 struct zone *zone = pgdat->node_zones + i;
1543 int nr_slab; 1546 int nr_slab;
1544 1547
1545 if (!populated_zone(zone)) 1548 if (!populated_zone(zone))
1546 continue; 1549 continue;
1547 1550
1548 if (zone_is_all_unreclaimable(zone) && 1551 if (zone_is_all_unreclaimable(zone) &&
1549 priority != DEF_PRIORITY) 1552 priority != DEF_PRIORITY)
1550 continue; 1553 continue;
1551 1554
1552 if (!zone_watermark_ok(zone, order, zone->pages_high, 1555 if (!zone_watermark_ok(zone, order, zone->pages_high,
1553 end_zone, 0)) 1556 end_zone, 0))
1554 all_zones_ok = 0; 1557 all_zones_ok = 0;
1555 temp_priority[i] = priority; 1558 temp_priority[i] = priority;
1556 sc.nr_scanned = 0; 1559 sc.nr_scanned = 0;
1557 note_zone_scanning_priority(zone, priority); 1560 note_zone_scanning_priority(zone, priority);
1558 /* 1561 /*
1559 * We put equal pressure on every zone, unless one 1562 * We put equal pressure on every zone, unless one
1560 * zone has way too many pages free already. 1563 * zone has way too many pages free already.
1561 */ 1564 */
1562 if (!zone_watermark_ok(zone, order, 8*zone->pages_high, 1565 if (!zone_watermark_ok(zone, order, 8*zone->pages_high,
1563 end_zone, 0)) 1566 end_zone, 0))
1564 nr_reclaimed += shrink_zone(priority, zone, &sc); 1567 nr_reclaimed += shrink_zone(priority, zone, &sc);
1565 reclaim_state->reclaimed_slab = 0; 1568 reclaim_state->reclaimed_slab = 0;
1566 nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL, 1569 nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
1567 lru_pages); 1570 lru_pages);
1568 nr_reclaimed += reclaim_state->reclaimed_slab; 1571 nr_reclaimed += reclaim_state->reclaimed_slab;
1569 total_scanned += sc.nr_scanned; 1572 total_scanned += sc.nr_scanned;
1570 if (zone_is_all_unreclaimable(zone)) 1573 if (zone_is_all_unreclaimable(zone))
1571 continue; 1574 continue;
1572 if (nr_slab == 0 && zone->pages_scanned >= 1575 if (nr_slab == 0 && zone->pages_scanned >=
1573 (zone_page_state(zone, NR_ACTIVE) 1576 (zone_page_state(zone, NR_ACTIVE)
1574 + zone_page_state(zone, NR_INACTIVE)) * 6) 1577 + zone_page_state(zone, NR_INACTIVE)) * 6)
1575 zone_set_flag(zone, 1578 zone_set_flag(zone,
1576 ZONE_ALL_UNRECLAIMABLE); 1579 ZONE_ALL_UNRECLAIMABLE);
1577 /* 1580 /*
1578 * If we've done a decent amount of scanning and 1581 * If we've done a decent amount of scanning and
1579 * the reclaim ratio is low, start doing writepage 1582 * the reclaim ratio is low, start doing writepage
1580 * even in laptop mode 1583 * even in laptop mode
1581 */ 1584 */
1582 if (total_scanned > SWAP_CLUSTER_MAX * 2 && 1585 if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
1583 total_scanned > nr_reclaimed + nr_reclaimed / 2) 1586 total_scanned > nr_reclaimed + nr_reclaimed / 2)
1584 sc.may_writepage = 1; 1587 sc.may_writepage = 1;
1585 } 1588 }
1586 if (all_zones_ok) 1589 if (all_zones_ok)
1587 break; /* kswapd: all done */ 1590 break; /* kswapd: all done */
1588 /* 1591 /*
1589 * OK, kswapd is getting into trouble. Take a nap, then take 1592 * OK, kswapd is getting into trouble. Take a nap, then take
1590 * another pass across the zones. 1593 * another pass across the zones.
1591 */ 1594 */
1592 if (total_scanned && priority < DEF_PRIORITY - 2) 1595 if (total_scanned && priority < DEF_PRIORITY - 2)
1593 congestion_wait(WRITE, HZ/10); 1596 congestion_wait(WRITE, HZ/10);
1594 1597
1595 /* 1598 /*
1596 * We do this so kswapd doesn't build up large priorities for 1599 * We do this so kswapd doesn't build up large priorities for
1597 * example when it is freeing in parallel with allocators. It 1600 * example when it is freeing in parallel with allocators. It
1598 * matches the direct reclaim path behaviour in terms of impact 1601 * matches the direct reclaim path behaviour in terms of impact
1599 * on zone->*_priority. 1602 * on zone->*_priority.
1600 */ 1603 */
1601 if (nr_reclaimed >= SWAP_CLUSTER_MAX) 1604 if (nr_reclaimed >= SWAP_CLUSTER_MAX)
1602 break; 1605 break;
1603 } 1606 }
1604 out: 1607 out:
1605 /* 1608 /*
1606 * Note within each zone the priority level at which this zone was 1609 * Note within each zone the priority level at which this zone was
1607 * brought into a happy state. So that the next thread which scans this 1610 * brought into a happy state. So that the next thread which scans this
1608 * zone will start out at that priority level. 1611 * zone will start out at that priority level.
1609 */ 1612 */
1610 for (i = 0; i < pgdat->nr_zones; i++) { 1613 for (i = 0; i < pgdat->nr_zones; i++) {
1611 struct zone *zone = pgdat->node_zones + i; 1614 struct zone *zone = pgdat->node_zones + i;
1612 1615
1613 zone->prev_priority = temp_priority[i]; 1616 zone->prev_priority = temp_priority[i];
1614 } 1617 }
1615 if (!all_zones_ok) { 1618 if (!all_zones_ok) {
1616 cond_resched(); 1619 cond_resched();
1617 1620
1618 try_to_freeze(); 1621 try_to_freeze();
1619 1622
1620 goto loop_again; 1623 goto loop_again;
1621 } 1624 }
1622 1625
1623 return nr_reclaimed; 1626 return nr_reclaimed;
1624 } 1627 }
1625 1628
1626 /* 1629 /*
1627 * The background pageout daemon, started as a kernel thread 1630 * The background pageout daemon, started as a kernel thread
1628 * from the init process. 1631 * from the init process.
1629 * 1632 *
1630 * This basically trickles out pages so that we have _some_ 1633 * This basically trickles out pages so that we have _some_
1631 * free memory available even if there is no other activity 1634 * free memory available even if there is no other activity
1632 * that frees anything up. This is needed for things like routing 1635 * that frees anything up. This is needed for things like routing
1633 * etc, where we otherwise might have all activity going on in 1636 * etc, where we otherwise might have all activity going on in
1634 * asynchronous contexts that cannot page things out. 1637 * asynchronous contexts that cannot page things out.
1635 * 1638 *
1636 * If there are applications that are active memory-allocators 1639 * If there are applications that are active memory-allocators
1637 * (most normal use), this basically shouldn't matter. 1640 * (most normal use), this basically shouldn't matter.
1638 */ 1641 */
1639 static int kswapd(void *p) 1642 static int kswapd(void *p)
1640 { 1643 {
1641 unsigned long order; 1644 unsigned long order;
1642 pg_data_t *pgdat = (pg_data_t*)p; 1645 pg_data_t *pgdat = (pg_data_t*)p;
1643 struct task_struct *tsk = current; 1646 struct task_struct *tsk = current;
1644 DEFINE_WAIT(wait); 1647 DEFINE_WAIT(wait);
1645 struct reclaim_state reclaim_state = { 1648 struct reclaim_state reclaim_state = {
1646 .reclaimed_slab = 0, 1649 .reclaimed_slab = 0,
1647 }; 1650 };
1648 node_to_cpumask_ptr(cpumask, pgdat->node_id); 1651 node_to_cpumask_ptr(cpumask, pgdat->node_id);
1649 1652
1650 if (!cpus_empty(*cpumask)) 1653 if (!cpus_empty(*cpumask))
1651 set_cpus_allowed_ptr(tsk, cpumask); 1654 set_cpus_allowed_ptr(tsk, cpumask);
1652 current->reclaim_state = &reclaim_state; 1655 current->reclaim_state = &reclaim_state;
1653 1656
1654 /* 1657 /*
1655 * Tell the memory management that we're a "memory allocator", 1658 * Tell the memory management that we're a "memory allocator",
1656 * and that if we need more memory we should get access to it 1659 * and that if we need more memory we should get access to it
1657 * regardless (see "__alloc_pages()"). "kswapd" should 1660 * regardless (see "__alloc_pages()"). "kswapd" should
1658 * never get caught in the normal page freeing logic. 1661 * never get caught in the normal page freeing logic.
1659 * 1662 *
1660 * (Kswapd normally doesn't need memory anyway, but sometimes 1663 * (Kswapd normally doesn't need memory anyway, but sometimes
1661 * you need a small amount of memory in order to be able to 1664 * you need a small amount of memory in order to be able to
1662 * page out something else, and this flag essentially protects 1665 * page out something else, and this flag essentially protects
1663 * us from recursively trying to free more memory as we're 1666 * us from recursively trying to free more memory as we're
1664 * trying to free the first piece of memory in the first place). 1667 * trying to free the first piece of memory in the first place).
1665 */ 1668 */
1666 tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; 1669 tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
1667 set_freezable(); 1670 set_freezable();
1668 1671
1669 order = 0; 1672 order = 0;
1670 for ( ; ; ) { 1673 for ( ; ; ) {
1671 unsigned long new_order; 1674 unsigned long new_order;
1672 1675
1673 prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE); 1676 prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
1674 new_order = pgdat->kswapd_max_order; 1677 new_order = pgdat->kswapd_max_order;
1675 pgdat->kswapd_max_order = 0; 1678 pgdat->kswapd_max_order = 0;
1676 if (order < new_order) { 1679 if (order < new_order) {
1677 /* 1680 /*
1678 * Don't sleep if someone wants a larger 'order' 1681 * Don't sleep if someone wants a larger 'order'
1679 * allocation 1682 * allocation
1680 */ 1683 */
1681 order = new_order; 1684 order = new_order;
1682 } else { 1685 } else {
1683 if (!freezing(current)) 1686 if (!freezing(current))
1684 schedule(); 1687 schedule();
1685 1688
1686 order = pgdat->kswapd_max_order; 1689 order = pgdat->kswapd_max_order;
1687 } 1690 }
1688 finish_wait(&pgdat->kswapd_wait, &wait); 1691 finish_wait(&pgdat->kswapd_wait, &wait);
1689 1692
1690 if (!try_to_freeze()) { 1693 if (!try_to_freeze()) {
1691 /* We can speed up thawing tasks if we don't call 1694 /* We can speed up thawing tasks if we don't call
1692 * balance_pgdat after returning from the refrigerator 1695 * balance_pgdat after returning from the refrigerator
1693 */ 1696 */
1694 balance_pgdat(pgdat, order); 1697 balance_pgdat(pgdat, order);
1695 } 1698 }
1696 } 1699 }
1697 return 0; 1700 return 0;
1698 } 1701 }
1699 1702
1700 /* 1703 /*
1701 * A zone is low on free memory, so wake its kswapd task to service it. 1704 * A zone is low on free memory, so wake its kswapd task to service it.
1702 */ 1705 */
1703 void wakeup_kswapd(struct zone *zone, int order) 1706 void wakeup_kswapd(struct zone *zone, int order)
1704 { 1707 {
1705 pg_data_t *pgdat; 1708 pg_data_t *pgdat;
1706 1709
1707 if (!populated_zone(zone)) 1710 if (!populated_zone(zone))
1708 return; 1711 return;
1709 1712
1710 pgdat = zone->zone_pgdat; 1713 pgdat = zone->zone_pgdat;
1711 if (zone_watermark_ok(zone, order, zone->pages_low, 0, 0)) 1714 if (zone_watermark_ok(zone, order, zone->pages_low, 0, 0))
1712 return; 1715 return;
1713 if (pgdat->kswapd_max_order < order) 1716 if (pgdat->kswapd_max_order < order)
1714 pgdat->kswapd_max_order = order; 1717 pgdat->kswapd_max_order = order;
1715 if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) 1718 if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
1716 return; 1719 return;
1717 if (!waitqueue_active(&pgdat->kswapd_wait)) 1720 if (!waitqueue_active(&pgdat->kswapd_wait))
1718 return; 1721 return;
1719 wake_up_interruptible(&pgdat->kswapd_wait); 1722 wake_up_interruptible(&pgdat->kswapd_wait);
1720 } 1723 }
1721 1724
1722 #ifdef CONFIG_PM 1725 #ifdef CONFIG_PM
1723 /* 1726 /*
1724 * Helper function for shrink_all_memory(). Tries to reclaim 'nr_pages' pages 1727 * Helper function for shrink_all_memory(). Tries to reclaim 'nr_pages' pages
1725 * from LRU lists system-wide, for given pass and priority, and returns the 1728 * from LRU lists system-wide, for given pass and priority, and returns the
1726 * number of reclaimed pages 1729 * number of reclaimed pages
1727 * 1730 *
1728 * For pass > 3 we also try to shrink the LRU lists that contain a few pages 1731 * For pass > 3 we also try to shrink the LRU lists that contain a few pages
1729 */ 1732 */
1730 static unsigned long shrink_all_zones(unsigned long nr_pages, int prio, 1733 static unsigned long shrink_all_zones(unsigned long nr_pages, int prio,
1731 int pass, struct scan_control *sc) 1734 int pass, struct scan_control *sc)
1732 { 1735 {
1733 struct zone *zone; 1736 struct zone *zone;
1734 unsigned long nr_to_scan, ret = 0; 1737 unsigned long nr_to_scan, ret = 0;
1735 1738
1736 for_each_zone(zone) { 1739 for_each_zone(zone) {
1737 1740
1738 if (!populated_zone(zone)) 1741 if (!populated_zone(zone))
1739 continue; 1742 continue;
1740 1743
1741 if (zone_is_all_unreclaimable(zone) && prio != DEF_PRIORITY) 1744 if (zone_is_all_unreclaimable(zone) && prio != DEF_PRIORITY)
1742 continue; 1745 continue;
1743 1746
1744 /* For pass = 0 we don't shrink the active list */ 1747 /* For pass = 0 we don't shrink the active list */
1745 if (pass > 0) { 1748 if (pass > 0) {
1746 zone->nr_scan_active += 1749 zone->nr_scan_active +=
1747 (zone_page_state(zone, NR_ACTIVE) >> prio) + 1; 1750 (zone_page_state(zone, NR_ACTIVE) >> prio) + 1;
1748 if (zone->nr_scan_active >= nr_pages || pass > 3) { 1751 if (zone->nr_scan_active >= nr_pages || pass > 3) {
1749 zone->nr_scan_active = 0; 1752 zone->nr_scan_active = 0;
1750 nr_to_scan = min(nr_pages, 1753 nr_to_scan = min(nr_pages,
1751 zone_page_state(zone, NR_ACTIVE)); 1754 zone_page_state(zone, NR_ACTIVE));
1752 shrink_active_list(nr_to_scan, zone, sc, prio); 1755 shrink_active_list(nr_to_scan, zone, sc, prio);
1753 } 1756 }
1754 } 1757 }
1755 1758
1756 zone->nr_scan_inactive += 1759 zone->nr_scan_inactive +=
1757 (zone_page_state(zone, NR_INACTIVE) >> prio) + 1; 1760 (zone_page_state(zone, NR_INACTIVE) >> prio) + 1;
1758 if (zone->nr_scan_inactive >= nr_pages || pass > 3) { 1761 if (zone->nr_scan_inactive >= nr_pages || pass > 3) {
1759 zone->nr_scan_inactive = 0; 1762 zone->nr_scan_inactive = 0;
1760 nr_to_scan = min(nr_pages, 1763 nr_to_scan = min(nr_pages,
1761 zone_page_state(zone, NR_INACTIVE)); 1764 zone_page_state(zone, NR_INACTIVE));
1762 ret += shrink_inactive_list(nr_to_scan, zone, sc); 1765 ret += shrink_inactive_list(nr_to_scan, zone, sc);
1763 if (ret >= nr_pages) 1766 if (ret >= nr_pages)
1764 return ret; 1767 return ret;
1765 } 1768 }
1766 } 1769 }
1767 1770
1768 return ret; 1771 return ret;
1769 } 1772 }
1770 1773
1771 static unsigned long count_lru_pages(void) 1774 static unsigned long count_lru_pages(void)
1772 { 1775 {
1773 return global_page_state(NR_ACTIVE) + global_page_state(NR_INACTIVE); 1776 return global_page_state(NR_ACTIVE) + global_page_state(NR_INACTIVE);
1774 } 1777 }
1775 1778
1776 /* 1779 /*
1777 * Try to free `nr_pages' of memory, system-wide, and return the number of 1780 * Try to free `nr_pages' of memory, system-wide, and return the number of
1778 * freed pages. 1781 * freed pages.
1779 * 1782 *
1780 * Rather than trying to age LRUs the aim is to preserve the overall 1783 * Rather than trying to age LRUs the aim is to preserve the overall
1781 * LRU order by reclaiming preferentially 1784 * LRU order by reclaiming preferentially
1782 * inactive > active > active referenced > active mapped 1785 * inactive > active > active referenced > active mapped
1783 */ 1786 */
1784 unsigned long shrink_all_memory(unsigned long nr_pages) 1787 unsigned long shrink_all_memory(unsigned long nr_pages)
1785 { 1788 {
1786 unsigned long lru_pages, nr_slab; 1789 unsigned long lru_pages, nr_slab;
1787 unsigned long ret = 0; 1790 unsigned long ret = 0;
1788 int pass; 1791 int pass;
1789 struct reclaim_state reclaim_state; 1792 struct reclaim_state reclaim_state;
1790 struct scan_control sc = { 1793 struct scan_control sc = {
1791 .gfp_mask = GFP_KERNEL, 1794 .gfp_mask = GFP_KERNEL,
1792 .may_swap = 0, 1795 .may_swap = 0,
1793 .swap_cluster_max = nr_pages, 1796 .swap_cluster_max = nr_pages,
1794 .may_writepage = 1, 1797 .may_writepage = 1,
1795 .swappiness = vm_swappiness, 1798 .swappiness = vm_swappiness,
1796 .isolate_pages = isolate_pages_global, 1799 .isolate_pages = isolate_pages_global,
1797 }; 1800 };
1798 1801
1799 current->reclaim_state = &reclaim_state; 1802 current->reclaim_state = &reclaim_state;
1800 1803
1801 lru_pages = count_lru_pages(); 1804 lru_pages = count_lru_pages();
1802 nr_slab = global_page_state(NR_SLAB_RECLAIMABLE); 1805 nr_slab = global_page_state(NR_SLAB_RECLAIMABLE);
1803 /* If slab caches are huge, it's better to hit them first */ 1806 /* If slab caches are huge, it's better to hit them first */
1804 while (nr_slab >= lru_pages) { 1807 while (nr_slab >= lru_pages) {
1805 reclaim_state.reclaimed_slab = 0; 1808 reclaim_state.reclaimed_slab = 0;
1806 shrink_slab(nr_pages, sc.gfp_mask, lru_pages); 1809 shrink_slab(nr_pages, sc.gfp_mask, lru_pages);
1807 if (!reclaim_state.reclaimed_slab) 1810 if (!reclaim_state.reclaimed_slab)
1808 break; 1811 break;
1809 1812
1810 ret += reclaim_state.reclaimed_slab; 1813 ret += reclaim_state.reclaimed_slab;
1811 if (ret >= nr_pages) 1814 if (ret >= nr_pages)
1812 goto out; 1815 goto out;
1813 1816
1814 nr_slab -= reclaim_state.reclaimed_slab; 1817 nr_slab -= reclaim_state.reclaimed_slab;
1815 } 1818 }
1816 1819
1817 /* 1820 /*
1818 * We try to shrink LRUs in 5 passes: 1821 * We try to shrink LRUs in 5 passes:
1819 * 0 = Reclaim from inactive_list only 1822 * 0 = Reclaim from inactive_list only
1820 * 1 = Reclaim from active list but don't reclaim mapped 1823 * 1 = Reclaim from active list but don't reclaim mapped
1821 * 2 = 2nd pass of type 1 1824 * 2 = 2nd pass of type 1
1822 * 3 = Reclaim mapped (normal reclaim) 1825 * 3 = Reclaim mapped (normal reclaim)
1823 * 4 = 2nd pass of type 3 1826 * 4 = 2nd pass of type 3
1824 */ 1827 */
1825 for (pass = 0; pass < 5; pass++) { 1828 for (pass = 0; pass < 5; pass++) {
1826 int prio; 1829 int prio;
1827 1830
1828 /* Force reclaiming mapped pages in the passes #3 and #4 */ 1831 /* Force reclaiming mapped pages in the passes #3 and #4 */
1829 if (pass > 2) { 1832 if (pass > 2) {
1830 sc.may_swap = 1; 1833 sc.may_swap = 1;
1831 sc.swappiness = 100; 1834 sc.swappiness = 100;
1832 } 1835 }
1833 1836
1834 for (prio = DEF_PRIORITY; prio >= 0; prio--) { 1837 for (prio = DEF_PRIORITY; prio >= 0; prio--) {
1835 unsigned long nr_to_scan = nr_pages - ret; 1838 unsigned long nr_to_scan = nr_pages - ret;
1836 1839
1837 sc.nr_scanned = 0; 1840 sc.nr_scanned = 0;
1838 ret += shrink_all_zones(nr_to_scan, prio, pass, &sc); 1841 ret += shrink_all_zones(nr_to_scan, prio, pass, &sc);
1839 if (ret >= nr_pages) 1842 if (ret >= nr_pages)
1840 goto out; 1843 goto out;
1841 1844
1842 reclaim_state.reclaimed_slab = 0; 1845 reclaim_state.reclaimed_slab = 0;
1843 shrink_slab(sc.nr_scanned, sc.gfp_mask, 1846 shrink_slab(sc.nr_scanned, sc.gfp_mask,
1844 count_lru_pages()); 1847 count_lru_pages());
1845 ret += reclaim_state.reclaimed_slab; 1848 ret += reclaim_state.reclaimed_slab;
1846 if (ret >= nr_pages) 1849 if (ret >= nr_pages)
1847 goto out; 1850 goto out;
1848 1851
1849 if (sc.nr_scanned && prio < DEF_PRIORITY - 2) 1852 if (sc.nr_scanned && prio < DEF_PRIORITY - 2)
1850 congestion_wait(WRITE, HZ / 10); 1853 congestion_wait(WRITE, HZ / 10);
1851 } 1854 }
1852 } 1855 }
1853 1856
1854 /* 1857 /*
1855 * If ret = 0, we could not shrink LRUs, but there may be something 1858 * If ret = 0, we could not shrink LRUs, but there may be something
1856 * in slab caches 1859 * in slab caches
1857 */ 1860 */
1858 if (!ret) { 1861 if (!ret) {
1859 do { 1862 do {
1860 reclaim_state.reclaimed_slab = 0; 1863 reclaim_state.reclaimed_slab = 0;
1861 shrink_slab(nr_pages, sc.gfp_mask, count_lru_pages()); 1864 shrink_slab(nr_pages, sc.gfp_mask, count_lru_pages());
1862 ret += reclaim_state.reclaimed_slab; 1865 ret += reclaim_state.reclaimed_slab;
1863 } while (ret < nr_pages && reclaim_state.reclaimed_slab > 0); 1866 } while (ret < nr_pages && reclaim_state.reclaimed_slab > 0);
1864 } 1867 }
1865 1868
1866 out: 1869 out:
1867 current->reclaim_state = NULL; 1870 current->reclaim_state = NULL;
1868 1871
1869 return ret; 1872 return ret;
1870 } 1873 }
1871 #endif 1874 #endif
1872 1875
1873 /* It's optimal to keep kswapds on the same CPUs as their memory, but 1876 /* It's optimal to keep kswapds on the same CPUs as their memory, but
1874 not required for correctness. So if the last cpu in a node goes 1877 not required for correctness. So if the last cpu in a node goes
1875 away, we get changed to run anywhere: as the first one comes back, 1878 away, we get changed to run anywhere: as the first one comes back,
1876 restore their cpu bindings. */ 1879 restore their cpu bindings. */
1877 static int __devinit cpu_callback(struct notifier_block *nfb, 1880 static int __devinit cpu_callback(struct notifier_block *nfb,
1878 unsigned long action, void *hcpu) 1881 unsigned long action, void *hcpu)
1879 { 1882 {
1880 int nid; 1883 int nid;
1881 1884
1882 if (action == CPU_ONLINE || action == CPU_ONLINE_FROZEN) { 1885 if (action == CPU_ONLINE || action == CPU_ONLINE_FROZEN) {
1883 for_each_node_state(nid, N_HIGH_MEMORY) { 1886 for_each_node_state(nid, N_HIGH_MEMORY) {
1884 pg_data_t *pgdat = NODE_DATA(nid); 1887 pg_data_t *pgdat = NODE_DATA(nid);
1885 node_to_cpumask_ptr(mask, pgdat->node_id); 1888 node_to_cpumask_ptr(mask, pgdat->node_id);
1886 1889
1887 if (any_online_cpu(*mask) < nr_cpu_ids) 1890 if (any_online_cpu(*mask) < nr_cpu_ids)
1888 /* One of our CPUs online: restore mask */ 1891 /* One of our CPUs online: restore mask */
1889 set_cpus_allowed_ptr(pgdat->kswapd, mask); 1892 set_cpus_allowed_ptr(pgdat->kswapd, mask);
1890 } 1893 }
1891 } 1894 }
1892 return NOTIFY_OK; 1895 return NOTIFY_OK;
1893 } 1896 }
1894 1897
1895 /* 1898 /*
1896 * This kswapd start function will be called by init and node-hot-add. 1899 * This kswapd start function will be called by init and node-hot-add.
1897 * On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added. 1900 * On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added.
1898 */ 1901 */
1899 int kswapd_run(int nid) 1902 int kswapd_run(int nid)
1900 { 1903 {
1901 pg_data_t *pgdat = NODE_DATA(nid); 1904 pg_data_t *pgdat = NODE_DATA(nid);
1902 int ret = 0; 1905 int ret = 0;
1903 1906
1904 if (pgdat->kswapd) 1907 if (pgdat->kswapd)
1905 return 0; 1908 return 0;
1906 1909
1907 pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid); 1910 pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
1908 if (IS_ERR(pgdat->kswapd)) { 1911 if (IS_ERR(pgdat->kswapd)) {
1909 /* failure at boot is fatal */ 1912 /* failure at boot is fatal */
1910 BUG_ON(system_state == SYSTEM_BOOTING); 1913 BUG_ON(system_state == SYSTEM_BOOTING);
1911 printk("Failed to start kswapd on node %d\n",nid); 1914 printk("Failed to start kswapd on node %d\n",nid);
1912 ret = -1; 1915 ret = -1;
1913 } 1916 }
1914 return ret; 1917 return ret;
1915 } 1918 }
1916 1919
1917 static int __init kswapd_init(void) 1920 static int __init kswapd_init(void)
1918 { 1921 {
1919 int nid; 1922 int nid;
1920 1923
1921 swap_setup(); 1924 swap_setup();
1922 for_each_node_state(nid, N_HIGH_MEMORY) 1925 for_each_node_state(nid, N_HIGH_MEMORY)
1923 kswapd_run(nid); 1926 kswapd_run(nid);
1924 hotcpu_notifier(cpu_callback, 0); 1927 hotcpu_notifier(cpu_callback, 0);
1925 return 0; 1928 return 0;
1926 } 1929 }
1927 1930
1928 module_init(kswapd_init) 1931 module_init(kswapd_init)
1929 1932
1930 #ifdef CONFIG_NUMA 1933 #ifdef CONFIG_NUMA
1931 /* 1934 /*
1932 * Zone reclaim mode 1935 * Zone reclaim mode
1933 * 1936 *
1934 * If non-zero call zone_reclaim when the number of free pages falls below 1937 * If non-zero call zone_reclaim when the number of free pages falls below
1935 * the watermarks. 1938 * the watermarks.
1936 */ 1939 */
1937 int zone_reclaim_mode __read_mostly; 1940 int zone_reclaim_mode __read_mostly;
1938 1941
1939 #define RECLAIM_OFF 0 1942 #define RECLAIM_OFF 0
1940 #define RECLAIM_ZONE (1<<0) /* Run shrink_cache on the zone */ 1943 #define RECLAIM_ZONE (1<<0) /* Run shrink_cache on the zone */
1941 #define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */ 1944 #define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */
1942 #define RECLAIM_SWAP (1<<2) /* Swap pages out during reclaim */ 1945 #define RECLAIM_SWAP (1<<2) /* Swap pages out during reclaim */
1943 1946
1944 /* 1947 /*
1945 * Priority for ZONE_RECLAIM. This determines the fraction of pages 1948 * Priority for ZONE_RECLAIM. This determines the fraction of pages
1946 * of a node considered for each zone_reclaim. 4 scans 1/16th of 1949 * of a node considered for each zone_reclaim. 4 scans 1/16th of
1947 * a zone. 1950 * a zone.
1948 */ 1951 */
1949 #define ZONE_RECLAIM_PRIORITY 4 1952 #define ZONE_RECLAIM_PRIORITY 4
1950 1953
1951 /* 1954 /*
1952 * Percentage of pages in a zone that must be unmapped for zone_reclaim to 1955 * Percentage of pages in a zone that must be unmapped for zone_reclaim to
1953 * occur. 1956 * occur.
1954 */ 1957 */
1955 int sysctl_min_unmapped_ratio = 1; 1958 int sysctl_min_unmapped_ratio = 1;
1956 1959
1957 /* 1960 /*
1958 * If the number of slab pages in a zone grows beyond this percentage then 1961 * If the number of slab pages in a zone grows beyond this percentage then
1959 * slab reclaim needs to occur. 1962 * slab reclaim needs to occur.
1960 */ 1963 */
1961 int sysctl_min_slab_ratio = 5; 1964 int sysctl_min_slab_ratio = 5;
1962 1965
1963 /* 1966 /*
1964 * Try to free up some pages from this zone through reclaim. 1967 * Try to free up some pages from this zone through reclaim.
1965 */ 1968 */
1966 static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) 1969 static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
1967 { 1970 {
1968 /* Minimum pages needed in order to stay on node */ 1971 /* Minimum pages needed in order to stay on node */
1969 const unsigned long nr_pages = 1 << order; 1972 const unsigned long nr_pages = 1 << order;
1970 struct task_struct *p = current; 1973 struct task_struct *p = current;
1971 struct reclaim_state reclaim_state; 1974 struct reclaim_state reclaim_state;
1972 int priority; 1975 int priority;
1973 unsigned long nr_reclaimed = 0; 1976 unsigned long nr_reclaimed = 0;
1974 struct scan_control sc = { 1977 struct scan_control sc = {
1975 .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE), 1978 .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
1976 .may_swap = !!(zone_reclaim_mode & RECLAIM_SWAP), 1979 .may_swap = !!(zone_reclaim_mode & RECLAIM_SWAP),
1977 .swap_cluster_max = max_t(unsigned long, nr_pages, 1980 .swap_cluster_max = max_t(unsigned long, nr_pages,
1978 SWAP_CLUSTER_MAX), 1981 SWAP_CLUSTER_MAX),
1979 .gfp_mask = gfp_mask, 1982 .gfp_mask = gfp_mask,
1980 .swappiness = vm_swappiness, 1983 .swappiness = vm_swappiness,
1981 .isolate_pages = isolate_pages_global, 1984 .isolate_pages = isolate_pages_global,
1982 }; 1985 };
1983 unsigned long slab_reclaimable; 1986 unsigned long slab_reclaimable;
1984 1987
1985 disable_swap_token(); 1988 disable_swap_token();
1986 cond_resched(); 1989 cond_resched();
1987 /* 1990 /*
1988 * We need to be able to allocate from the reserves for RECLAIM_SWAP 1991 * We need to be able to allocate from the reserves for RECLAIM_SWAP
1989 * and we also need to be able to write out pages for RECLAIM_WRITE 1992 * and we also need to be able to write out pages for RECLAIM_WRITE
1990 * and RECLAIM_SWAP. 1993 * and RECLAIM_SWAP.
1991 */ 1994 */
1992 p->flags |= PF_MEMALLOC | PF_SWAPWRITE; 1995 p->flags |= PF_MEMALLOC | PF_SWAPWRITE;
1993 reclaim_state.reclaimed_slab = 0; 1996 reclaim_state.reclaimed_slab = 0;
1994 p->reclaim_state = &reclaim_state; 1997 p->reclaim_state = &reclaim_state;
1995 1998
1996 if (zone_page_state(zone, NR_FILE_PAGES) - 1999 if (zone_page_state(zone, NR_FILE_PAGES) -
1997 zone_page_state(zone, NR_FILE_MAPPED) > 2000 zone_page_state(zone, NR_FILE_MAPPED) >
1998 zone->min_unmapped_pages) { 2001 zone->min_unmapped_pages) {
1999 /* 2002 /*
2000 * Free memory by calling shrink zone with increasing 2003 * Free memory by calling shrink zone with increasing
2001 * priorities until we have enough memory freed. 2004 * priorities until we have enough memory freed.
2002 */ 2005 */
2003 priority = ZONE_RECLAIM_PRIORITY; 2006 priority = ZONE_RECLAIM_PRIORITY;
2004 do { 2007 do {
2005 note_zone_scanning_priority(zone, priority); 2008 note_zone_scanning_priority(zone, priority);
2006 nr_reclaimed += shrink_zone(priority, zone, &sc); 2009 nr_reclaimed += shrink_zone(priority, zone, &sc);
2007 priority--; 2010 priority--;
2008 } while (priority >= 0 && nr_reclaimed < nr_pages); 2011 } while (priority >= 0 && nr_reclaimed < nr_pages);
2009 } 2012 }
2010 2013
2011 slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE); 2014 slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
2012 if (slab_reclaimable > zone->min_slab_pages) { 2015 if (slab_reclaimable > zone->min_slab_pages) {
2013 /* 2016 /*
2014 * shrink_slab() does not currently allow us to determine how 2017 * shrink_slab() does not currently allow us to determine how
2015 * many pages were freed in this zone. So we take the current 2018 * many pages were freed in this zone. So we take the current
2016 * number of slab pages and shake the slab until it is reduced 2019 * number of slab pages and shake the slab until it is reduced
2017 * by the same nr_pages that we used for reclaiming unmapped 2020 * by the same nr_pages that we used for reclaiming unmapped
2018 * pages. 2021 * pages.
2019 * 2022 *
2020 * Note that shrink_slab will free memory on all zones and may 2023 * Note that shrink_slab will free memory on all zones and may
2021 * take a long time. 2024 * take a long time.
2022 */ 2025 */
2023 while (shrink_slab(sc.nr_scanned, gfp_mask, order) && 2026 while (shrink_slab(sc.nr_scanned, gfp_mask, order) &&
2024 zone_page_state(zone, NR_SLAB_RECLAIMABLE) > 2027 zone_page_state(zone, NR_SLAB_RECLAIMABLE) >
2025 slab_reclaimable - nr_pages) 2028 slab_reclaimable - nr_pages)
2026 ; 2029 ;
2027 2030
2028 /* 2031 /*
2029 * Update nr_reclaimed by the number of slab pages we 2032 * Update nr_reclaimed by the number of slab pages we
2030 * reclaimed from this zone. 2033 * reclaimed from this zone.
2031 */ 2034 */
2032 nr_reclaimed += slab_reclaimable - 2035 nr_reclaimed += slab_reclaimable -
2033 zone_page_state(zone, NR_SLAB_RECLAIMABLE); 2036 zone_page_state(zone, NR_SLAB_RECLAIMABLE);
2034 } 2037 }
2035 2038
2036 p->reclaim_state = NULL; 2039 p->reclaim_state = NULL;
2037 current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE); 2040 current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE);
2038 return nr_reclaimed >= nr_pages; 2041 return nr_reclaimed >= nr_pages;
2039 } 2042 }
2040 2043
2041 int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) 2044 int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
2042 { 2045 {
2043 int node_id; 2046 int node_id;
2044 int ret; 2047 int ret;
2045 2048
2046 /* 2049 /*
2047 * Zone reclaim reclaims unmapped file backed pages and 2050 * Zone reclaim reclaims unmapped file backed pages and
2048 * slab pages if we are over the defined limits. 2051 * slab pages if we are over the defined limits.
2049 * 2052 *
2050 * A small portion of unmapped file backed pages is needed for 2053 * A small portion of unmapped file backed pages is needed for
2051 * file I/O otherwise pages read by file I/O will be immediately 2054 * file I/O otherwise pages read by file I/O will be immediately
2052 * thrown out if the zone is overallocated. So we do not reclaim 2055 * thrown out if the zone is overallocated. So we do not reclaim
2053 * if less than a specified percentage of the zone is used by 2056 * if less than a specified percentage of the zone is used by
2054 * unmapped file backed pages. 2057 * unmapped file backed pages.
2055 */ 2058 */
2056 if (zone_page_state(zone, NR_FILE_PAGES) - 2059 if (zone_page_state(zone, NR_FILE_PAGES) -
2057 zone_page_state(zone, NR_FILE_MAPPED) <= zone->min_unmapped_pages 2060 zone_page_state(zone, NR_FILE_MAPPED) <= zone->min_unmapped_pages
2058 && zone_page_state(zone, NR_SLAB_RECLAIMABLE) 2061 && zone_page_state(zone, NR_SLAB_RECLAIMABLE)
2059 <= zone->min_slab_pages) 2062 <= zone->min_slab_pages)
2060 return 0; 2063 return 0;
2061 2064
2062 if (zone_is_all_unreclaimable(zone)) 2065 if (zone_is_all_unreclaimable(zone))
2063 return 0; 2066 return 0;
2064 2067
2065 /* 2068 /*
2066 * Do not scan if the allocation should not be delayed. 2069 * Do not scan if the allocation should not be delayed.
2067 */ 2070 */
2068 if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC)) 2071 if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC))
2069 return 0; 2072 return 0;
2070 2073
2071 /* 2074 /*
2072 * Only run zone reclaim on the local zone or on zones that do not 2075 * Only run zone reclaim on the local zone or on zones that do not
2073 * have associated processors. This will favor the local processor 2076 * have associated processors. This will favor the local processor
2074 * over remote processors and spread off node memory allocations 2077 * over remote processors and spread off node memory allocations
2075 * as wide as possible. 2078 * as wide as possible.
2076 */ 2079 */
2077 node_id = zone_to_nid(zone); 2080 node_id = zone_to_nid(zone);
2078 if (node_state(node_id, N_CPU) && node_id != numa_node_id()) 2081 if (node_state(node_id, N_CPU) && node_id != numa_node_id())
2079 return 0; 2082 return 0;
2080 2083
2081 if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED)) 2084 if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
2082 return 0; 2085 return 0;
2083 ret = __zone_reclaim(zone, gfp_mask, order); 2086 ret = __zone_reclaim(zone, gfp_mask, order);
2084 zone_clear_flag(zone, ZONE_RECLAIM_LOCKED); 2087 zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
2085 2088
2086 return ret; 2089 return ret;
2087 } 2090 }
2088 #endif 2091 #endif
2089 2092