Commit a41f24ea9fd6169b147c53c2392e2887cc1d9247
Committed by
Linus Torvalds
1 parent
ab857d0938
Exists in
master
and in
4 other branches
page allocator: smarter retry of costly-order allocations
Because of page order checks in __alloc_pages(), hugepage (and similarly large order) allocations will not retry unless explicitly marked __GFP_REPEAT. However, the current retry logic is nearly an infinite loop (or until reclaim does no progress whatsoever). For these costly allocations, that seems like overkill and could potentially never terminate. Mel observed that allowing current __GFP_REPEAT semantics for hugepage allocations essentially killed the system. I believe this is because we may continue to reclaim small orders of pages all over, but never have enough to satisfy the hugepage allocation request. This is clearly only a problem for large order allocations, of which hugepages are the most obvious (to me). Modify try_to_free_pages() to indicate how many pages were reclaimed. Use that information in __alloc_pages() to eventually fail a large __GFP_REPEAT allocation when we've reclaimed an order of pages equal to or greater than the allocation's order. This relies on lumpy reclaim functioning as advertised. Due to fragmentation, lumpy reclaim may not be able to free up the order needed in one invocation, so multiple iterations may be requred. In other words, the more fragmented memory is, the more retry attempts __GFP_REPEAT will make (particularly for higher order allocations). This changes the semantics of __GFP_REPEAT subtly, but *only* for allocations > PAGE_ALLOC_COSTLY_ORDER. With this patch, for those size allocations, we will try up to some point (at least 1<<order reclaimed pages), rather than forever (which is the case for allocations <= PAGE_ALLOC_COSTLY_ORDER). This change improves the /proc/sys/vm/nr_hugepages interface with a follow-on patch that makes pool allocations use __GFP_REPEAT. Rather than administrators repeatedly echo'ing a particular value into the sysctl, and forcing reclaim into action manually, this change allows for the sysctl to attempt a reasonable effort itself. Similarly, dynamic pool growth should be more successful under load, as lumpy reclaim can try to free up pages, rather than failing right away. Choosing to reclaim only up to the order of the requested allocation strikes a balance between not failing hugepage allocations and returning to the caller when it's unlikely to every succeed. Because of lumpy reclaim, if we have freed the order requested, hopefully it has been in big chunks and those chunks will allow our allocation to succeed. If that isn't the case after freeing up the current order, I don't think it is likely to succeed in the future, although it is possible given a particular fragmentation pattern. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Tested-by: Mel Gorman <mel@csn.ul.ie> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Showing 2 changed files with 22 additions and 7 deletions Inline Diff
mm/page_alloc.c
1 | /* | 1 | /* |
2 | * linux/mm/page_alloc.c | 2 | * linux/mm/page_alloc.c |
3 | * | 3 | * |
4 | * Manages the free list, the system allocates free pages here. | 4 | * Manages the free list, the system allocates free pages here. |
5 | * Note that kmalloc() lives in slab.c | 5 | * Note that kmalloc() lives in slab.c |
6 | * | 6 | * |
7 | * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds | 7 | * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds |
8 | * Swap reorganised 29.12.95, Stephen Tweedie | 8 | * Swap reorganised 29.12.95, Stephen Tweedie |
9 | * Support of BIGMEM added by Gerhard Wichert, Siemens AG, July 1999 | 9 | * Support of BIGMEM added by Gerhard Wichert, Siemens AG, July 1999 |
10 | * Reshaped it to be a zoned allocator, Ingo Molnar, Red Hat, 1999 | 10 | * Reshaped it to be a zoned allocator, Ingo Molnar, Red Hat, 1999 |
11 | * Discontiguous memory support, Kanoj Sarcar, SGI, Nov 1999 | 11 | * Discontiguous memory support, Kanoj Sarcar, SGI, Nov 1999 |
12 | * Zone balancing, Kanoj Sarcar, SGI, Jan 2000 | 12 | * Zone balancing, Kanoj Sarcar, SGI, Jan 2000 |
13 | * Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002 | 13 | * Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002 |
14 | * (lots of bits borrowed from Ingo Molnar & Andrew Morton) | 14 | * (lots of bits borrowed from Ingo Molnar & Andrew Morton) |
15 | */ | 15 | */ |
16 | 16 | ||
17 | #include <linux/stddef.h> | 17 | #include <linux/stddef.h> |
18 | #include <linux/mm.h> | 18 | #include <linux/mm.h> |
19 | #include <linux/swap.h> | 19 | #include <linux/swap.h> |
20 | #include <linux/interrupt.h> | 20 | #include <linux/interrupt.h> |
21 | #include <linux/pagemap.h> | 21 | #include <linux/pagemap.h> |
22 | #include <linux/jiffies.h> | 22 | #include <linux/jiffies.h> |
23 | #include <linux/bootmem.h> | 23 | #include <linux/bootmem.h> |
24 | #include <linux/compiler.h> | 24 | #include <linux/compiler.h> |
25 | #include <linux/kernel.h> | 25 | #include <linux/kernel.h> |
26 | #include <linux/module.h> | 26 | #include <linux/module.h> |
27 | #include <linux/suspend.h> | 27 | #include <linux/suspend.h> |
28 | #include <linux/pagevec.h> | 28 | #include <linux/pagevec.h> |
29 | #include <linux/blkdev.h> | 29 | #include <linux/blkdev.h> |
30 | #include <linux/slab.h> | 30 | #include <linux/slab.h> |
31 | #include <linux/oom.h> | 31 | #include <linux/oom.h> |
32 | #include <linux/notifier.h> | 32 | #include <linux/notifier.h> |
33 | #include <linux/topology.h> | 33 | #include <linux/topology.h> |
34 | #include <linux/sysctl.h> | 34 | #include <linux/sysctl.h> |
35 | #include <linux/cpu.h> | 35 | #include <linux/cpu.h> |
36 | #include <linux/cpuset.h> | 36 | #include <linux/cpuset.h> |
37 | #include <linux/memory_hotplug.h> | 37 | #include <linux/memory_hotplug.h> |
38 | #include <linux/nodemask.h> | 38 | #include <linux/nodemask.h> |
39 | #include <linux/vmalloc.h> | 39 | #include <linux/vmalloc.h> |
40 | #include <linux/mempolicy.h> | 40 | #include <linux/mempolicy.h> |
41 | #include <linux/stop_machine.h> | 41 | #include <linux/stop_machine.h> |
42 | #include <linux/sort.h> | 42 | #include <linux/sort.h> |
43 | #include <linux/pfn.h> | 43 | #include <linux/pfn.h> |
44 | #include <linux/backing-dev.h> | 44 | #include <linux/backing-dev.h> |
45 | #include <linux/fault-inject.h> | 45 | #include <linux/fault-inject.h> |
46 | #include <linux/page-isolation.h> | 46 | #include <linux/page-isolation.h> |
47 | #include <linux/memcontrol.h> | 47 | #include <linux/memcontrol.h> |
48 | 48 | ||
49 | #include <asm/tlbflush.h> | 49 | #include <asm/tlbflush.h> |
50 | #include <asm/div64.h> | 50 | #include <asm/div64.h> |
51 | #include "internal.h" | 51 | #include "internal.h" |
52 | 52 | ||
53 | /* | 53 | /* |
54 | * Array of node states. | 54 | * Array of node states. |
55 | */ | 55 | */ |
56 | nodemask_t node_states[NR_NODE_STATES] __read_mostly = { | 56 | nodemask_t node_states[NR_NODE_STATES] __read_mostly = { |
57 | [N_POSSIBLE] = NODE_MASK_ALL, | 57 | [N_POSSIBLE] = NODE_MASK_ALL, |
58 | [N_ONLINE] = { { [0] = 1UL } }, | 58 | [N_ONLINE] = { { [0] = 1UL } }, |
59 | #ifndef CONFIG_NUMA | 59 | #ifndef CONFIG_NUMA |
60 | [N_NORMAL_MEMORY] = { { [0] = 1UL } }, | 60 | [N_NORMAL_MEMORY] = { { [0] = 1UL } }, |
61 | #ifdef CONFIG_HIGHMEM | 61 | #ifdef CONFIG_HIGHMEM |
62 | [N_HIGH_MEMORY] = { { [0] = 1UL } }, | 62 | [N_HIGH_MEMORY] = { { [0] = 1UL } }, |
63 | #endif | 63 | #endif |
64 | [N_CPU] = { { [0] = 1UL } }, | 64 | [N_CPU] = { { [0] = 1UL } }, |
65 | #endif /* NUMA */ | 65 | #endif /* NUMA */ |
66 | }; | 66 | }; |
67 | EXPORT_SYMBOL(node_states); | 67 | EXPORT_SYMBOL(node_states); |
68 | 68 | ||
69 | unsigned long totalram_pages __read_mostly; | 69 | unsigned long totalram_pages __read_mostly; |
70 | unsigned long totalreserve_pages __read_mostly; | 70 | unsigned long totalreserve_pages __read_mostly; |
71 | long nr_swap_pages; | 71 | long nr_swap_pages; |
72 | int percpu_pagelist_fraction; | 72 | int percpu_pagelist_fraction; |
73 | 73 | ||
74 | #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE | 74 | #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE |
75 | int pageblock_order __read_mostly; | 75 | int pageblock_order __read_mostly; |
76 | #endif | 76 | #endif |
77 | 77 | ||
78 | static void __free_pages_ok(struct page *page, unsigned int order); | 78 | static void __free_pages_ok(struct page *page, unsigned int order); |
79 | 79 | ||
80 | /* | 80 | /* |
81 | * results with 256, 32 in the lowmem_reserve sysctl: | 81 | * results with 256, 32 in the lowmem_reserve sysctl: |
82 | * 1G machine -> (16M dma, 800M-16M normal, 1G-800M high) | 82 | * 1G machine -> (16M dma, 800M-16M normal, 1G-800M high) |
83 | * 1G machine -> (16M dma, 784M normal, 224M high) | 83 | * 1G machine -> (16M dma, 784M normal, 224M high) |
84 | * NORMAL allocation will leave 784M/256 of ram reserved in the ZONE_DMA | 84 | * NORMAL allocation will leave 784M/256 of ram reserved in the ZONE_DMA |
85 | * HIGHMEM allocation will leave 224M/32 of ram reserved in ZONE_NORMAL | 85 | * HIGHMEM allocation will leave 224M/32 of ram reserved in ZONE_NORMAL |
86 | * HIGHMEM allocation will (224M+784M)/256 of ram reserved in ZONE_DMA | 86 | * HIGHMEM allocation will (224M+784M)/256 of ram reserved in ZONE_DMA |
87 | * | 87 | * |
88 | * TBD: should special case ZONE_DMA32 machines here - in those we normally | 88 | * TBD: should special case ZONE_DMA32 machines here - in those we normally |
89 | * don't need any ZONE_NORMAL reservation | 89 | * don't need any ZONE_NORMAL reservation |
90 | */ | 90 | */ |
91 | int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = { | 91 | int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = { |
92 | #ifdef CONFIG_ZONE_DMA | 92 | #ifdef CONFIG_ZONE_DMA |
93 | 256, | 93 | 256, |
94 | #endif | 94 | #endif |
95 | #ifdef CONFIG_ZONE_DMA32 | 95 | #ifdef CONFIG_ZONE_DMA32 |
96 | 256, | 96 | 256, |
97 | #endif | 97 | #endif |
98 | #ifdef CONFIG_HIGHMEM | 98 | #ifdef CONFIG_HIGHMEM |
99 | 32, | 99 | 32, |
100 | #endif | 100 | #endif |
101 | 32, | 101 | 32, |
102 | }; | 102 | }; |
103 | 103 | ||
104 | EXPORT_SYMBOL(totalram_pages); | 104 | EXPORT_SYMBOL(totalram_pages); |
105 | 105 | ||
106 | static char * const zone_names[MAX_NR_ZONES] = { | 106 | static char * const zone_names[MAX_NR_ZONES] = { |
107 | #ifdef CONFIG_ZONE_DMA | 107 | #ifdef CONFIG_ZONE_DMA |
108 | "DMA", | 108 | "DMA", |
109 | #endif | 109 | #endif |
110 | #ifdef CONFIG_ZONE_DMA32 | 110 | #ifdef CONFIG_ZONE_DMA32 |
111 | "DMA32", | 111 | "DMA32", |
112 | #endif | 112 | #endif |
113 | "Normal", | 113 | "Normal", |
114 | #ifdef CONFIG_HIGHMEM | 114 | #ifdef CONFIG_HIGHMEM |
115 | "HighMem", | 115 | "HighMem", |
116 | #endif | 116 | #endif |
117 | "Movable", | 117 | "Movable", |
118 | }; | 118 | }; |
119 | 119 | ||
120 | int min_free_kbytes = 1024; | 120 | int min_free_kbytes = 1024; |
121 | 121 | ||
122 | unsigned long __meminitdata nr_kernel_pages; | 122 | unsigned long __meminitdata nr_kernel_pages; |
123 | unsigned long __meminitdata nr_all_pages; | 123 | unsigned long __meminitdata nr_all_pages; |
124 | static unsigned long __meminitdata dma_reserve; | 124 | static unsigned long __meminitdata dma_reserve; |
125 | 125 | ||
126 | #ifdef CONFIG_ARCH_POPULATES_NODE_MAP | 126 | #ifdef CONFIG_ARCH_POPULATES_NODE_MAP |
127 | /* | 127 | /* |
128 | * MAX_ACTIVE_REGIONS determines the maximum number of distinct | 128 | * MAX_ACTIVE_REGIONS determines the maximum number of distinct |
129 | * ranges of memory (RAM) that may be registered with add_active_range(). | 129 | * ranges of memory (RAM) that may be registered with add_active_range(). |
130 | * Ranges passed to add_active_range() will be merged if possible | 130 | * Ranges passed to add_active_range() will be merged if possible |
131 | * so the number of times add_active_range() can be called is | 131 | * so the number of times add_active_range() can be called is |
132 | * related to the number of nodes and the number of holes | 132 | * related to the number of nodes and the number of holes |
133 | */ | 133 | */ |
134 | #ifdef CONFIG_MAX_ACTIVE_REGIONS | 134 | #ifdef CONFIG_MAX_ACTIVE_REGIONS |
135 | /* Allow an architecture to set MAX_ACTIVE_REGIONS to save memory */ | 135 | /* Allow an architecture to set MAX_ACTIVE_REGIONS to save memory */ |
136 | #define MAX_ACTIVE_REGIONS CONFIG_MAX_ACTIVE_REGIONS | 136 | #define MAX_ACTIVE_REGIONS CONFIG_MAX_ACTIVE_REGIONS |
137 | #else | 137 | #else |
138 | #if MAX_NUMNODES >= 32 | 138 | #if MAX_NUMNODES >= 32 |
139 | /* If there can be many nodes, allow up to 50 holes per node */ | 139 | /* If there can be many nodes, allow up to 50 holes per node */ |
140 | #define MAX_ACTIVE_REGIONS (MAX_NUMNODES*50) | 140 | #define MAX_ACTIVE_REGIONS (MAX_NUMNODES*50) |
141 | #else | 141 | #else |
142 | /* By default, allow up to 256 distinct regions */ | 142 | /* By default, allow up to 256 distinct regions */ |
143 | #define MAX_ACTIVE_REGIONS 256 | 143 | #define MAX_ACTIVE_REGIONS 256 |
144 | #endif | 144 | #endif |
145 | #endif | 145 | #endif |
146 | 146 | ||
147 | static struct node_active_region __meminitdata early_node_map[MAX_ACTIVE_REGIONS]; | 147 | static struct node_active_region __meminitdata early_node_map[MAX_ACTIVE_REGIONS]; |
148 | static int __meminitdata nr_nodemap_entries; | 148 | static int __meminitdata nr_nodemap_entries; |
149 | static unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES]; | 149 | static unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES]; |
150 | static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES]; | 150 | static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES]; |
151 | #ifdef CONFIG_MEMORY_HOTPLUG_RESERVE | 151 | #ifdef CONFIG_MEMORY_HOTPLUG_RESERVE |
152 | static unsigned long __meminitdata node_boundary_start_pfn[MAX_NUMNODES]; | 152 | static unsigned long __meminitdata node_boundary_start_pfn[MAX_NUMNODES]; |
153 | static unsigned long __meminitdata node_boundary_end_pfn[MAX_NUMNODES]; | 153 | static unsigned long __meminitdata node_boundary_end_pfn[MAX_NUMNODES]; |
154 | #endif /* CONFIG_MEMORY_HOTPLUG_RESERVE */ | 154 | #endif /* CONFIG_MEMORY_HOTPLUG_RESERVE */ |
155 | unsigned long __initdata required_kernelcore; | 155 | unsigned long __initdata required_kernelcore; |
156 | static unsigned long __initdata required_movablecore; | 156 | static unsigned long __initdata required_movablecore; |
157 | unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES]; | 157 | unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES]; |
158 | 158 | ||
159 | /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */ | 159 | /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */ |
160 | int movable_zone; | 160 | int movable_zone; |
161 | EXPORT_SYMBOL(movable_zone); | 161 | EXPORT_SYMBOL(movable_zone); |
162 | #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */ | 162 | #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */ |
163 | 163 | ||
164 | #if MAX_NUMNODES > 1 | 164 | #if MAX_NUMNODES > 1 |
165 | int nr_node_ids __read_mostly = MAX_NUMNODES; | 165 | int nr_node_ids __read_mostly = MAX_NUMNODES; |
166 | EXPORT_SYMBOL(nr_node_ids); | 166 | EXPORT_SYMBOL(nr_node_ids); |
167 | #endif | 167 | #endif |
168 | 168 | ||
169 | int page_group_by_mobility_disabled __read_mostly; | 169 | int page_group_by_mobility_disabled __read_mostly; |
170 | 170 | ||
171 | static void set_pageblock_migratetype(struct page *page, int migratetype) | 171 | static void set_pageblock_migratetype(struct page *page, int migratetype) |
172 | { | 172 | { |
173 | set_pageblock_flags_group(page, (unsigned long)migratetype, | 173 | set_pageblock_flags_group(page, (unsigned long)migratetype, |
174 | PB_migrate, PB_migrate_end); | 174 | PB_migrate, PB_migrate_end); |
175 | } | 175 | } |
176 | 176 | ||
177 | #ifdef CONFIG_DEBUG_VM | 177 | #ifdef CONFIG_DEBUG_VM |
178 | static int page_outside_zone_boundaries(struct zone *zone, struct page *page) | 178 | static int page_outside_zone_boundaries(struct zone *zone, struct page *page) |
179 | { | 179 | { |
180 | int ret = 0; | 180 | int ret = 0; |
181 | unsigned seq; | 181 | unsigned seq; |
182 | unsigned long pfn = page_to_pfn(page); | 182 | unsigned long pfn = page_to_pfn(page); |
183 | 183 | ||
184 | do { | 184 | do { |
185 | seq = zone_span_seqbegin(zone); | 185 | seq = zone_span_seqbegin(zone); |
186 | if (pfn >= zone->zone_start_pfn + zone->spanned_pages) | 186 | if (pfn >= zone->zone_start_pfn + zone->spanned_pages) |
187 | ret = 1; | 187 | ret = 1; |
188 | else if (pfn < zone->zone_start_pfn) | 188 | else if (pfn < zone->zone_start_pfn) |
189 | ret = 1; | 189 | ret = 1; |
190 | } while (zone_span_seqretry(zone, seq)); | 190 | } while (zone_span_seqretry(zone, seq)); |
191 | 191 | ||
192 | return ret; | 192 | return ret; |
193 | } | 193 | } |
194 | 194 | ||
195 | static int page_is_consistent(struct zone *zone, struct page *page) | 195 | static int page_is_consistent(struct zone *zone, struct page *page) |
196 | { | 196 | { |
197 | if (!pfn_valid_within(page_to_pfn(page))) | 197 | if (!pfn_valid_within(page_to_pfn(page))) |
198 | return 0; | 198 | return 0; |
199 | if (zone != page_zone(page)) | 199 | if (zone != page_zone(page)) |
200 | return 0; | 200 | return 0; |
201 | 201 | ||
202 | return 1; | 202 | return 1; |
203 | } | 203 | } |
204 | /* | 204 | /* |
205 | * Temporary debugging check for pages not lying within a given zone. | 205 | * Temporary debugging check for pages not lying within a given zone. |
206 | */ | 206 | */ |
207 | static int bad_range(struct zone *zone, struct page *page) | 207 | static int bad_range(struct zone *zone, struct page *page) |
208 | { | 208 | { |
209 | if (page_outside_zone_boundaries(zone, page)) | 209 | if (page_outside_zone_boundaries(zone, page)) |
210 | return 1; | 210 | return 1; |
211 | if (!page_is_consistent(zone, page)) | 211 | if (!page_is_consistent(zone, page)) |
212 | return 1; | 212 | return 1; |
213 | 213 | ||
214 | return 0; | 214 | return 0; |
215 | } | 215 | } |
216 | #else | 216 | #else |
217 | static inline int bad_range(struct zone *zone, struct page *page) | 217 | static inline int bad_range(struct zone *zone, struct page *page) |
218 | { | 218 | { |
219 | return 0; | 219 | return 0; |
220 | } | 220 | } |
221 | #endif | 221 | #endif |
222 | 222 | ||
223 | static void bad_page(struct page *page) | 223 | static void bad_page(struct page *page) |
224 | { | 224 | { |
225 | void *pc = page_get_page_cgroup(page); | 225 | void *pc = page_get_page_cgroup(page); |
226 | 226 | ||
227 | printk(KERN_EMERG "Bad page state in process '%s'\n" KERN_EMERG | 227 | printk(KERN_EMERG "Bad page state in process '%s'\n" KERN_EMERG |
228 | "page:%p flags:0x%0*lx mapping:%p mapcount:%d count:%d\n", | 228 | "page:%p flags:0x%0*lx mapping:%p mapcount:%d count:%d\n", |
229 | current->comm, page, (int)(2*sizeof(unsigned long)), | 229 | current->comm, page, (int)(2*sizeof(unsigned long)), |
230 | (unsigned long)page->flags, page->mapping, | 230 | (unsigned long)page->flags, page->mapping, |
231 | page_mapcount(page), page_count(page)); | 231 | page_mapcount(page), page_count(page)); |
232 | if (pc) { | 232 | if (pc) { |
233 | printk(KERN_EMERG "cgroup:%p\n", pc); | 233 | printk(KERN_EMERG "cgroup:%p\n", pc); |
234 | page_reset_bad_cgroup(page); | 234 | page_reset_bad_cgroup(page); |
235 | } | 235 | } |
236 | printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n" | 236 | printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n" |
237 | KERN_EMERG "Backtrace:\n"); | 237 | KERN_EMERG "Backtrace:\n"); |
238 | dump_stack(); | 238 | dump_stack(); |
239 | page->flags &= ~(1 << PG_lru | | 239 | page->flags &= ~(1 << PG_lru | |
240 | 1 << PG_private | | 240 | 1 << PG_private | |
241 | 1 << PG_locked | | 241 | 1 << PG_locked | |
242 | 1 << PG_active | | 242 | 1 << PG_active | |
243 | 1 << PG_dirty | | 243 | 1 << PG_dirty | |
244 | 1 << PG_reclaim | | 244 | 1 << PG_reclaim | |
245 | 1 << PG_slab | | 245 | 1 << PG_slab | |
246 | 1 << PG_swapcache | | 246 | 1 << PG_swapcache | |
247 | 1 << PG_writeback | | 247 | 1 << PG_writeback | |
248 | 1 << PG_buddy ); | 248 | 1 << PG_buddy ); |
249 | set_page_count(page, 0); | 249 | set_page_count(page, 0); |
250 | reset_page_mapcount(page); | 250 | reset_page_mapcount(page); |
251 | page->mapping = NULL; | 251 | page->mapping = NULL; |
252 | add_taint(TAINT_BAD_PAGE); | 252 | add_taint(TAINT_BAD_PAGE); |
253 | } | 253 | } |
254 | 254 | ||
255 | /* | 255 | /* |
256 | * Higher-order pages are called "compound pages". They are structured thusly: | 256 | * Higher-order pages are called "compound pages". They are structured thusly: |
257 | * | 257 | * |
258 | * The first PAGE_SIZE page is called the "head page". | 258 | * The first PAGE_SIZE page is called the "head page". |
259 | * | 259 | * |
260 | * The remaining PAGE_SIZE pages are called "tail pages". | 260 | * The remaining PAGE_SIZE pages are called "tail pages". |
261 | * | 261 | * |
262 | * All pages have PG_compound set. All pages have their ->private pointing at | 262 | * All pages have PG_compound set. All pages have their ->private pointing at |
263 | * the head page (even the head page has this). | 263 | * the head page (even the head page has this). |
264 | * | 264 | * |
265 | * The first tail page's ->lru.next holds the address of the compound page's | 265 | * The first tail page's ->lru.next holds the address of the compound page's |
266 | * put_page() function. Its ->lru.prev holds the order of allocation. | 266 | * put_page() function. Its ->lru.prev holds the order of allocation. |
267 | * This usage means that zero-order pages may not be compound. | 267 | * This usage means that zero-order pages may not be compound. |
268 | */ | 268 | */ |
269 | 269 | ||
270 | static void free_compound_page(struct page *page) | 270 | static void free_compound_page(struct page *page) |
271 | { | 271 | { |
272 | __free_pages_ok(page, compound_order(page)); | 272 | __free_pages_ok(page, compound_order(page)); |
273 | } | 273 | } |
274 | 274 | ||
275 | static void prep_compound_page(struct page *page, unsigned long order) | 275 | static void prep_compound_page(struct page *page, unsigned long order) |
276 | { | 276 | { |
277 | int i; | 277 | int i; |
278 | int nr_pages = 1 << order; | 278 | int nr_pages = 1 << order; |
279 | 279 | ||
280 | set_compound_page_dtor(page, free_compound_page); | 280 | set_compound_page_dtor(page, free_compound_page); |
281 | set_compound_order(page, order); | 281 | set_compound_order(page, order); |
282 | __SetPageHead(page); | 282 | __SetPageHead(page); |
283 | for (i = 1; i < nr_pages; i++) { | 283 | for (i = 1; i < nr_pages; i++) { |
284 | struct page *p = page + i; | 284 | struct page *p = page + i; |
285 | 285 | ||
286 | __SetPageTail(p); | 286 | __SetPageTail(p); |
287 | p->first_page = page; | 287 | p->first_page = page; |
288 | } | 288 | } |
289 | } | 289 | } |
290 | 290 | ||
291 | static void destroy_compound_page(struct page *page, unsigned long order) | 291 | static void destroy_compound_page(struct page *page, unsigned long order) |
292 | { | 292 | { |
293 | int i; | 293 | int i; |
294 | int nr_pages = 1 << order; | 294 | int nr_pages = 1 << order; |
295 | 295 | ||
296 | if (unlikely(compound_order(page) != order)) | 296 | if (unlikely(compound_order(page) != order)) |
297 | bad_page(page); | 297 | bad_page(page); |
298 | 298 | ||
299 | if (unlikely(!PageHead(page))) | 299 | if (unlikely(!PageHead(page))) |
300 | bad_page(page); | 300 | bad_page(page); |
301 | __ClearPageHead(page); | 301 | __ClearPageHead(page); |
302 | for (i = 1; i < nr_pages; i++) { | 302 | for (i = 1; i < nr_pages; i++) { |
303 | struct page *p = page + i; | 303 | struct page *p = page + i; |
304 | 304 | ||
305 | if (unlikely(!PageTail(p) | | 305 | if (unlikely(!PageTail(p) | |
306 | (p->first_page != page))) | 306 | (p->first_page != page))) |
307 | bad_page(page); | 307 | bad_page(page); |
308 | __ClearPageTail(p); | 308 | __ClearPageTail(p); |
309 | } | 309 | } |
310 | } | 310 | } |
311 | 311 | ||
312 | static inline void prep_zero_page(struct page *page, int order, gfp_t gfp_flags) | 312 | static inline void prep_zero_page(struct page *page, int order, gfp_t gfp_flags) |
313 | { | 313 | { |
314 | int i; | 314 | int i; |
315 | 315 | ||
316 | /* | 316 | /* |
317 | * clear_highpage() will use KM_USER0, so it's a bug to use __GFP_ZERO | 317 | * clear_highpage() will use KM_USER0, so it's a bug to use __GFP_ZERO |
318 | * and __GFP_HIGHMEM from hard or soft interrupt context. | 318 | * and __GFP_HIGHMEM from hard or soft interrupt context. |
319 | */ | 319 | */ |
320 | VM_BUG_ON((gfp_flags & __GFP_HIGHMEM) && in_interrupt()); | 320 | VM_BUG_ON((gfp_flags & __GFP_HIGHMEM) && in_interrupt()); |
321 | for (i = 0; i < (1 << order); i++) | 321 | for (i = 0; i < (1 << order); i++) |
322 | clear_highpage(page + i); | 322 | clear_highpage(page + i); |
323 | } | 323 | } |
324 | 324 | ||
325 | static inline void set_page_order(struct page *page, int order) | 325 | static inline void set_page_order(struct page *page, int order) |
326 | { | 326 | { |
327 | set_page_private(page, order); | 327 | set_page_private(page, order); |
328 | __SetPageBuddy(page); | 328 | __SetPageBuddy(page); |
329 | } | 329 | } |
330 | 330 | ||
331 | static inline void rmv_page_order(struct page *page) | 331 | static inline void rmv_page_order(struct page *page) |
332 | { | 332 | { |
333 | __ClearPageBuddy(page); | 333 | __ClearPageBuddy(page); |
334 | set_page_private(page, 0); | 334 | set_page_private(page, 0); |
335 | } | 335 | } |
336 | 336 | ||
337 | /* | 337 | /* |
338 | * Locate the struct page for both the matching buddy in our | 338 | * Locate the struct page for both the matching buddy in our |
339 | * pair (buddy1) and the combined O(n+1) page they form (page). | 339 | * pair (buddy1) and the combined O(n+1) page they form (page). |
340 | * | 340 | * |
341 | * 1) Any buddy B1 will have an order O twin B2 which satisfies | 341 | * 1) Any buddy B1 will have an order O twin B2 which satisfies |
342 | * the following equation: | 342 | * the following equation: |
343 | * B2 = B1 ^ (1 << O) | 343 | * B2 = B1 ^ (1 << O) |
344 | * For example, if the starting buddy (buddy2) is #8 its order | 344 | * For example, if the starting buddy (buddy2) is #8 its order |
345 | * 1 buddy is #10: | 345 | * 1 buddy is #10: |
346 | * B2 = 8 ^ (1 << 1) = 8 ^ 2 = 10 | 346 | * B2 = 8 ^ (1 << 1) = 8 ^ 2 = 10 |
347 | * | 347 | * |
348 | * 2) Any buddy B will have an order O+1 parent P which | 348 | * 2) Any buddy B will have an order O+1 parent P which |
349 | * satisfies the following equation: | 349 | * satisfies the following equation: |
350 | * P = B & ~(1 << O) | 350 | * P = B & ~(1 << O) |
351 | * | 351 | * |
352 | * Assumption: *_mem_map is contiguous at least up to MAX_ORDER | 352 | * Assumption: *_mem_map is contiguous at least up to MAX_ORDER |
353 | */ | 353 | */ |
354 | static inline struct page * | 354 | static inline struct page * |
355 | __page_find_buddy(struct page *page, unsigned long page_idx, unsigned int order) | 355 | __page_find_buddy(struct page *page, unsigned long page_idx, unsigned int order) |
356 | { | 356 | { |
357 | unsigned long buddy_idx = page_idx ^ (1 << order); | 357 | unsigned long buddy_idx = page_idx ^ (1 << order); |
358 | 358 | ||
359 | return page + (buddy_idx - page_idx); | 359 | return page + (buddy_idx - page_idx); |
360 | } | 360 | } |
361 | 361 | ||
362 | static inline unsigned long | 362 | static inline unsigned long |
363 | __find_combined_index(unsigned long page_idx, unsigned int order) | 363 | __find_combined_index(unsigned long page_idx, unsigned int order) |
364 | { | 364 | { |
365 | return (page_idx & ~(1 << order)); | 365 | return (page_idx & ~(1 << order)); |
366 | } | 366 | } |
367 | 367 | ||
368 | /* | 368 | /* |
369 | * This function checks whether a page is free && is the buddy | 369 | * This function checks whether a page is free && is the buddy |
370 | * we can do coalesce a page and its buddy if | 370 | * we can do coalesce a page and its buddy if |
371 | * (a) the buddy is not in a hole && | 371 | * (a) the buddy is not in a hole && |
372 | * (b) the buddy is in the buddy system && | 372 | * (b) the buddy is in the buddy system && |
373 | * (c) a page and its buddy have the same order && | 373 | * (c) a page and its buddy have the same order && |
374 | * (d) a page and its buddy are in the same zone. | 374 | * (d) a page and its buddy are in the same zone. |
375 | * | 375 | * |
376 | * For recording whether a page is in the buddy system, we use PG_buddy. | 376 | * For recording whether a page is in the buddy system, we use PG_buddy. |
377 | * Setting, clearing, and testing PG_buddy is serialized by zone->lock. | 377 | * Setting, clearing, and testing PG_buddy is serialized by zone->lock. |
378 | * | 378 | * |
379 | * For recording page's order, we use page_private(page). | 379 | * For recording page's order, we use page_private(page). |
380 | */ | 380 | */ |
381 | static inline int page_is_buddy(struct page *page, struct page *buddy, | 381 | static inline int page_is_buddy(struct page *page, struct page *buddy, |
382 | int order) | 382 | int order) |
383 | { | 383 | { |
384 | if (!pfn_valid_within(page_to_pfn(buddy))) | 384 | if (!pfn_valid_within(page_to_pfn(buddy))) |
385 | return 0; | 385 | return 0; |
386 | 386 | ||
387 | if (page_zone_id(page) != page_zone_id(buddy)) | 387 | if (page_zone_id(page) != page_zone_id(buddy)) |
388 | return 0; | 388 | return 0; |
389 | 389 | ||
390 | if (PageBuddy(buddy) && page_order(buddy) == order) { | 390 | if (PageBuddy(buddy) && page_order(buddy) == order) { |
391 | BUG_ON(page_count(buddy) != 0); | 391 | BUG_ON(page_count(buddy) != 0); |
392 | return 1; | 392 | return 1; |
393 | } | 393 | } |
394 | return 0; | 394 | return 0; |
395 | } | 395 | } |
396 | 396 | ||
397 | /* | 397 | /* |
398 | * Freeing function for a buddy system allocator. | 398 | * Freeing function for a buddy system allocator. |
399 | * | 399 | * |
400 | * The concept of a buddy system is to maintain direct-mapped table | 400 | * The concept of a buddy system is to maintain direct-mapped table |
401 | * (containing bit values) for memory blocks of various "orders". | 401 | * (containing bit values) for memory blocks of various "orders". |
402 | * The bottom level table contains the map for the smallest allocatable | 402 | * The bottom level table contains the map for the smallest allocatable |
403 | * units of memory (here, pages), and each level above it describes | 403 | * units of memory (here, pages), and each level above it describes |
404 | * pairs of units from the levels below, hence, "buddies". | 404 | * pairs of units from the levels below, hence, "buddies". |
405 | * At a high level, all that happens here is marking the table entry | 405 | * At a high level, all that happens here is marking the table entry |
406 | * at the bottom level available, and propagating the changes upward | 406 | * at the bottom level available, and propagating the changes upward |
407 | * as necessary, plus some accounting needed to play nicely with other | 407 | * as necessary, plus some accounting needed to play nicely with other |
408 | * parts of the VM system. | 408 | * parts of the VM system. |
409 | * At each level, we keep a list of pages, which are heads of continuous | 409 | * At each level, we keep a list of pages, which are heads of continuous |
410 | * free pages of length of (1 << order) and marked with PG_buddy. Page's | 410 | * free pages of length of (1 << order) and marked with PG_buddy. Page's |
411 | * order is recorded in page_private(page) field. | 411 | * order is recorded in page_private(page) field. |
412 | * So when we are allocating or freeing one, we can derive the state of the | 412 | * So when we are allocating or freeing one, we can derive the state of the |
413 | * other. That is, if we allocate a small block, and both were | 413 | * other. That is, if we allocate a small block, and both were |
414 | * free, the remainder of the region must be split into blocks. | 414 | * free, the remainder of the region must be split into blocks. |
415 | * If a block is freed, and its buddy is also free, then this | 415 | * If a block is freed, and its buddy is also free, then this |
416 | * triggers coalescing into a block of larger size. | 416 | * triggers coalescing into a block of larger size. |
417 | * | 417 | * |
418 | * -- wli | 418 | * -- wli |
419 | */ | 419 | */ |
420 | 420 | ||
421 | static inline void __free_one_page(struct page *page, | 421 | static inline void __free_one_page(struct page *page, |
422 | struct zone *zone, unsigned int order) | 422 | struct zone *zone, unsigned int order) |
423 | { | 423 | { |
424 | unsigned long page_idx; | 424 | unsigned long page_idx; |
425 | int order_size = 1 << order; | 425 | int order_size = 1 << order; |
426 | int migratetype = get_pageblock_migratetype(page); | 426 | int migratetype = get_pageblock_migratetype(page); |
427 | 427 | ||
428 | if (unlikely(PageCompound(page))) | 428 | if (unlikely(PageCompound(page))) |
429 | destroy_compound_page(page, order); | 429 | destroy_compound_page(page, order); |
430 | 430 | ||
431 | page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1); | 431 | page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1); |
432 | 432 | ||
433 | VM_BUG_ON(page_idx & (order_size - 1)); | 433 | VM_BUG_ON(page_idx & (order_size - 1)); |
434 | VM_BUG_ON(bad_range(zone, page)); | 434 | VM_BUG_ON(bad_range(zone, page)); |
435 | 435 | ||
436 | __mod_zone_page_state(zone, NR_FREE_PAGES, order_size); | 436 | __mod_zone_page_state(zone, NR_FREE_PAGES, order_size); |
437 | while (order < MAX_ORDER-1) { | 437 | while (order < MAX_ORDER-1) { |
438 | unsigned long combined_idx; | 438 | unsigned long combined_idx; |
439 | struct page *buddy; | 439 | struct page *buddy; |
440 | 440 | ||
441 | buddy = __page_find_buddy(page, page_idx, order); | 441 | buddy = __page_find_buddy(page, page_idx, order); |
442 | if (!page_is_buddy(page, buddy, order)) | 442 | if (!page_is_buddy(page, buddy, order)) |
443 | break; /* Move the buddy up one level. */ | 443 | break; /* Move the buddy up one level. */ |
444 | 444 | ||
445 | list_del(&buddy->lru); | 445 | list_del(&buddy->lru); |
446 | zone->free_area[order].nr_free--; | 446 | zone->free_area[order].nr_free--; |
447 | rmv_page_order(buddy); | 447 | rmv_page_order(buddy); |
448 | combined_idx = __find_combined_index(page_idx, order); | 448 | combined_idx = __find_combined_index(page_idx, order); |
449 | page = page + (combined_idx - page_idx); | 449 | page = page + (combined_idx - page_idx); |
450 | page_idx = combined_idx; | 450 | page_idx = combined_idx; |
451 | order++; | 451 | order++; |
452 | } | 452 | } |
453 | set_page_order(page, order); | 453 | set_page_order(page, order); |
454 | list_add(&page->lru, | 454 | list_add(&page->lru, |
455 | &zone->free_area[order].free_list[migratetype]); | 455 | &zone->free_area[order].free_list[migratetype]); |
456 | zone->free_area[order].nr_free++; | 456 | zone->free_area[order].nr_free++; |
457 | } | 457 | } |
458 | 458 | ||
459 | static inline int free_pages_check(struct page *page) | 459 | static inline int free_pages_check(struct page *page) |
460 | { | 460 | { |
461 | if (unlikely(page_mapcount(page) | | 461 | if (unlikely(page_mapcount(page) | |
462 | (page->mapping != NULL) | | 462 | (page->mapping != NULL) | |
463 | (page_get_page_cgroup(page) != NULL) | | 463 | (page_get_page_cgroup(page) != NULL) | |
464 | (page_count(page) != 0) | | 464 | (page_count(page) != 0) | |
465 | (page->flags & ( | 465 | (page->flags & ( |
466 | 1 << PG_lru | | 466 | 1 << PG_lru | |
467 | 1 << PG_private | | 467 | 1 << PG_private | |
468 | 1 << PG_locked | | 468 | 1 << PG_locked | |
469 | 1 << PG_active | | 469 | 1 << PG_active | |
470 | 1 << PG_slab | | 470 | 1 << PG_slab | |
471 | 1 << PG_swapcache | | 471 | 1 << PG_swapcache | |
472 | 1 << PG_writeback | | 472 | 1 << PG_writeback | |
473 | 1 << PG_reserved | | 473 | 1 << PG_reserved | |
474 | 1 << PG_buddy )))) | 474 | 1 << PG_buddy )))) |
475 | bad_page(page); | 475 | bad_page(page); |
476 | if (PageDirty(page)) | 476 | if (PageDirty(page)) |
477 | __ClearPageDirty(page); | 477 | __ClearPageDirty(page); |
478 | /* | 478 | /* |
479 | * For now, we report if PG_reserved was found set, but do not | 479 | * For now, we report if PG_reserved was found set, but do not |
480 | * clear it, and do not free the page. But we shall soon need | 480 | * clear it, and do not free the page. But we shall soon need |
481 | * to do more, for when the ZERO_PAGE count wraps negative. | 481 | * to do more, for when the ZERO_PAGE count wraps negative. |
482 | */ | 482 | */ |
483 | return PageReserved(page); | 483 | return PageReserved(page); |
484 | } | 484 | } |
485 | 485 | ||
486 | /* | 486 | /* |
487 | * Frees a list of pages. | 487 | * Frees a list of pages. |
488 | * Assumes all pages on list are in same zone, and of same order. | 488 | * Assumes all pages on list are in same zone, and of same order. |
489 | * count is the number of pages to free. | 489 | * count is the number of pages to free. |
490 | * | 490 | * |
491 | * If the zone was previously in an "all pages pinned" state then look to | 491 | * If the zone was previously in an "all pages pinned" state then look to |
492 | * see if this freeing clears that state. | 492 | * see if this freeing clears that state. |
493 | * | 493 | * |
494 | * And clear the zone's pages_scanned counter, to hold off the "all pages are | 494 | * And clear the zone's pages_scanned counter, to hold off the "all pages are |
495 | * pinned" detection logic. | 495 | * pinned" detection logic. |
496 | */ | 496 | */ |
497 | static void free_pages_bulk(struct zone *zone, int count, | 497 | static void free_pages_bulk(struct zone *zone, int count, |
498 | struct list_head *list, int order) | 498 | struct list_head *list, int order) |
499 | { | 499 | { |
500 | spin_lock(&zone->lock); | 500 | spin_lock(&zone->lock); |
501 | zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE); | 501 | zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE); |
502 | zone->pages_scanned = 0; | 502 | zone->pages_scanned = 0; |
503 | while (count--) { | 503 | while (count--) { |
504 | struct page *page; | 504 | struct page *page; |
505 | 505 | ||
506 | VM_BUG_ON(list_empty(list)); | 506 | VM_BUG_ON(list_empty(list)); |
507 | page = list_entry(list->prev, struct page, lru); | 507 | page = list_entry(list->prev, struct page, lru); |
508 | /* have to delete it as __free_one_page list manipulates */ | 508 | /* have to delete it as __free_one_page list manipulates */ |
509 | list_del(&page->lru); | 509 | list_del(&page->lru); |
510 | __free_one_page(page, zone, order); | 510 | __free_one_page(page, zone, order); |
511 | } | 511 | } |
512 | spin_unlock(&zone->lock); | 512 | spin_unlock(&zone->lock); |
513 | } | 513 | } |
514 | 514 | ||
515 | static void free_one_page(struct zone *zone, struct page *page, int order) | 515 | static void free_one_page(struct zone *zone, struct page *page, int order) |
516 | { | 516 | { |
517 | spin_lock(&zone->lock); | 517 | spin_lock(&zone->lock); |
518 | zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE); | 518 | zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE); |
519 | zone->pages_scanned = 0; | 519 | zone->pages_scanned = 0; |
520 | __free_one_page(page, zone, order); | 520 | __free_one_page(page, zone, order); |
521 | spin_unlock(&zone->lock); | 521 | spin_unlock(&zone->lock); |
522 | } | 522 | } |
523 | 523 | ||
524 | static void __free_pages_ok(struct page *page, unsigned int order) | 524 | static void __free_pages_ok(struct page *page, unsigned int order) |
525 | { | 525 | { |
526 | unsigned long flags; | 526 | unsigned long flags; |
527 | int i; | 527 | int i; |
528 | int reserved = 0; | 528 | int reserved = 0; |
529 | 529 | ||
530 | for (i = 0 ; i < (1 << order) ; ++i) | 530 | for (i = 0 ; i < (1 << order) ; ++i) |
531 | reserved += free_pages_check(page + i); | 531 | reserved += free_pages_check(page + i); |
532 | if (reserved) | 532 | if (reserved) |
533 | return; | 533 | return; |
534 | 534 | ||
535 | if (!PageHighMem(page)) | 535 | if (!PageHighMem(page)) |
536 | debug_check_no_locks_freed(page_address(page),PAGE_SIZE<<order); | 536 | debug_check_no_locks_freed(page_address(page),PAGE_SIZE<<order); |
537 | arch_free_page(page, order); | 537 | arch_free_page(page, order); |
538 | kernel_map_pages(page, 1 << order, 0); | 538 | kernel_map_pages(page, 1 << order, 0); |
539 | 539 | ||
540 | local_irq_save(flags); | 540 | local_irq_save(flags); |
541 | __count_vm_events(PGFREE, 1 << order); | 541 | __count_vm_events(PGFREE, 1 << order); |
542 | free_one_page(page_zone(page), page, order); | 542 | free_one_page(page_zone(page), page, order); |
543 | local_irq_restore(flags); | 543 | local_irq_restore(flags); |
544 | } | 544 | } |
545 | 545 | ||
546 | /* | 546 | /* |
547 | * permit the bootmem allocator to evade page validation on high-order frees | 547 | * permit the bootmem allocator to evade page validation on high-order frees |
548 | */ | 548 | */ |
549 | void __free_pages_bootmem(struct page *page, unsigned int order) | 549 | void __free_pages_bootmem(struct page *page, unsigned int order) |
550 | { | 550 | { |
551 | if (order == 0) { | 551 | if (order == 0) { |
552 | __ClearPageReserved(page); | 552 | __ClearPageReserved(page); |
553 | set_page_count(page, 0); | 553 | set_page_count(page, 0); |
554 | set_page_refcounted(page); | 554 | set_page_refcounted(page); |
555 | __free_page(page); | 555 | __free_page(page); |
556 | } else { | 556 | } else { |
557 | int loop; | 557 | int loop; |
558 | 558 | ||
559 | prefetchw(page); | 559 | prefetchw(page); |
560 | for (loop = 0; loop < BITS_PER_LONG; loop++) { | 560 | for (loop = 0; loop < BITS_PER_LONG; loop++) { |
561 | struct page *p = &page[loop]; | 561 | struct page *p = &page[loop]; |
562 | 562 | ||
563 | if (loop + 1 < BITS_PER_LONG) | 563 | if (loop + 1 < BITS_PER_LONG) |
564 | prefetchw(p + 1); | 564 | prefetchw(p + 1); |
565 | __ClearPageReserved(p); | 565 | __ClearPageReserved(p); |
566 | set_page_count(p, 0); | 566 | set_page_count(p, 0); |
567 | } | 567 | } |
568 | 568 | ||
569 | set_page_refcounted(page); | 569 | set_page_refcounted(page); |
570 | __free_pages(page, order); | 570 | __free_pages(page, order); |
571 | } | 571 | } |
572 | } | 572 | } |
573 | 573 | ||
574 | 574 | ||
575 | /* | 575 | /* |
576 | * The order of subdivision here is critical for the IO subsystem. | 576 | * The order of subdivision here is critical for the IO subsystem. |
577 | * Please do not alter this order without good reasons and regression | 577 | * Please do not alter this order without good reasons and regression |
578 | * testing. Specifically, as large blocks of memory are subdivided, | 578 | * testing. Specifically, as large blocks of memory are subdivided, |
579 | * the order in which smaller blocks are delivered depends on the order | 579 | * the order in which smaller blocks are delivered depends on the order |
580 | * they're subdivided in this function. This is the primary factor | 580 | * they're subdivided in this function. This is the primary factor |
581 | * influencing the order in which pages are delivered to the IO | 581 | * influencing the order in which pages are delivered to the IO |
582 | * subsystem according to empirical testing, and this is also justified | 582 | * subsystem according to empirical testing, and this is also justified |
583 | * by considering the behavior of a buddy system containing a single | 583 | * by considering the behavior of a buddy system containing a single |
584 | * large block of memory acted on by a series of small allocations. | 584 | * large block of memory acted on by a series of small allocations. |
585 | * This behavior is a critical factor in sglist merging's success. | 585 | * This behavior is a critical factor in sglist merging's success. |
586 | * | 586 | * |
587 | * -- wli | 587 | * -- wli |
588 | */ | 588 | */ |
589 | static inline void expand(struct zone *zone, struct page *page, | 589 | static inline void expand(struct zone *zone, struct page *page, |
590 | int low, int high, struct free_area *area, | 590 | int low, int high, struct free_area *area, |
591 | int migratetype) | 591 | int migratetype) |
592 | { | 592 | { |
593 | unsigned long size = 1 << high; | 593 | unsigned long size = 1 << high; |
594 | 594 | ||
595 | while (high > low) { | 595 | while (high > low) { |
596 | area--; | 596 | area--; |
597 | high--; | 597 | high--; |
598 | size >>= 1; | 598 | size >>= 1; |
599 | VM_BUG_ON(bad_range(zone, &page[size])); | 599 | VM_BUG_ON(bad_range(zone, &page[size])); |
600 | list_add(&page[size].lru, &area->free_list[migratetype]); | 600 | list_add(&page[size].lru, &area->free_list[migratetype]); |
601 | area->nr_free++; | 601 | area->nr_free++; |
602 | set_page_order(&page[size], high); | 602 | set_page_order(&page[size], high); |
603 | } | 603 | } |
604 | } | 604 | } |
605 | 605 | ||
606 | /* | 606 | /* |
607 | * This page is about to be returned from the page allocator | 607 | * This page is about to be returned from the page allocator |
608 | */ | 608 | */ |
609 | static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) | 609 | static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) |
610 | { | 610 | { |
611 | if (unlikely(page_mapcount(page) | | 611 | if (unlikely(page_mapcount(page) | |
612 | (page->mapping != NULL) | | 612 | (page->mapping != NULL) | |
613 | (page_get_page_cgroup(page) != NULL) | | 613 | (page_get_page_cgroup(page) != NULL) | |
614 | (page_count(page) != 0) | | 614 | (page_count(page) != 0) | |
615 | (page->flags & ( | 615 | (page->flags & ( |
616 | 1 << PG_lru | | 616 | 1 << PG_lru | |
617 | 1 << PG_private | | 617 | 1 << PG_private | |
618 | 1 << PG_locked | | 618 | 1 << PG_locked | |
619 | 1 << PG_active | | 619 | 1 << PG_active | |
620 | 1 << PG_dirty | | 620 | 1 << PG_dirty | |
621 | 1 << PG_slab | | 621 | 1 << PG_slab | |
622 | 1 << PG_swapcache | | 622 | 1 << PG_swapcache | |
623 | 1 << PG_writeback | | 623 | 1 << PG_writeback | |
624 | 1 << PG_reserved | | 624 | 1 << PG_reserved | |
625 | 1 << PG_buddy )))) | 625 | 1 << PG_buddy )))) |
626 | bad_page(page); | 626 | bad_page(page); |
627 | 627 | ||
628 | /* | 628 | /* |
629 | * For now, we report if PG_reserved was found set, but do not | 629 | * For now, we report if PG_reserved was found set, but do not |
630 | * clear it, and do not allocate the page: as a safety net. | 630 | * clear it, and do not allocate the page: as a safety net. |
631 | */ | 631 | */ |
632 | if (PageReserved(page)) | 632 | if (PageReserved(page)) |
633 | return 1; | 633 | return 1; |
634 | 634 | ||
635 | page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_reclaim | | 635 | page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_reclaim | |
636 | 1 << PG_referenced | 1 << PG_arch_1 | | 636 | 1 << PG_referenced | 1 << PG_arch_1 | |
637 | 1 << PG_owner_priv_1 | 1 << PG_mappedtodisk); | 637 | 1 << PG_owner_priv_1 | 1 << PG_mappedtodisk); |
638 | set_page_private(page, 0); | 638 | set_page_private(page, 0); |
639 | set_page_refcounted(page); | 639 | set_page_refcounted(page); |
640 | 640 | ||
641 | arch_alloc_page(page, order); | 641 | arch_alloc_page(page, order); |
642 | kernel_map_pages(page, 1 << order, 1); | 642 | kernel_map_pages(page, 1 << order, 1); |
643 | 643 | ||
644 | if (gfp_flags & __GFP_ZERO) | 644 | if (gfp_flags & __GFP_ZERO) |
645 | prep_zero_page(page, order, gfp_flags); | 645 | prep_zero_page(page, order, gfp_flags); |
646 | 646 | ||
647 | if (order && (gfp_flags & __GFP_COMP)) | 647 | if (order && (gfp_flags & __GFP_COMP)) |
648 | prep_compound_page(page, order); | 648 | prep_compound_page(page, order); |
649 | 649 | ||
650 | return 0; | 650 | return 0; |
651 | } | 651 | } |
652 | 652 | ||
653 | /* | 653 | /* |
654 | * Go through the free lists for the given migratetype and remove | 654 | * Go through the free lists for the given migratetype and remove |
655 | * the smallest available page from the freelists | 655 | * the smallest available page from the freelists |
656 | */ | 656 | */ |
657 | static struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, | 657 | static struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, |
658 | int migratetype) | 658 | int migratetype) |
659 | { | 659 | { |
660 | unsigned int current_order; | 660 | unsigned int current_order; |
661 | struct free_area * area; | 661 | struct free_area * area; |
662 | struct page *page; | 662 | struct page *page; |
663 | 663 | ||
664 | /* Find a page of the appropriate size in the preferred list */ | 664 | /* Find a page of the appropriate size in the preferred list */ |
665 | for (current_order = order; current_order < MAX_ORDER; ++current_order) { | 665 | for (current_order = order; current_order < MAX_ORDER; ++current_order) { |
666 | area = &(zone->free_area[current_order]); | 666 | area = &(zone->free_area[current_order]); |
667 | if (list_empty(&area->free_list[migratetype])) | 667 | if (list_empty(&area->free_list[migratetype])) |
668 | continue; | 668 | continue; |
669 | 669 | ||
670 | page = list_entry(area->free_list[migratetype].next, | 670 | page = list_entry(area->free_list[migratetype].next, |
671 | struct page, lru); | 671 | struct page, lru); |
672 | list_del(&page->lru); | 672 | list_del(&page->lru); |
673 | rmv_page_order(page); | 673 | rmv_page_order(page); |
674 | area->nr_free--; | 674 | area->nr_free--; |
675 | __mod_zone_page_state(zone, NR_FREE_PAGES, - (1UL << order)); | 675 | __mod_zone_page_state(zone, NR_FREE_PAGES, - (1UL << order)); |
676 | expand(zone, page, order, current_order, area, migratetype); | 676 | expand(zone, page, order, current_order, area, migratetype); |
677 | return page; | 677 | return page; |
678 | } | 678 | } |
679 | 679 | ||
680 | return NULL; | 680 | return NULL; |
681 | } | 681 | } |
682 | 682 | ||
683 | 683 | ||
684 | /* | 684 | /* |
685 | * This array describes the order lists are fallen back to when | 685 | * This array describes the order lists are fallen back to when |
686 | * the free lists for the desirable migrate type are depleted | 686 | * the free lists for the desirable migrate type are depleted |
687 | */ | 687 | */ |
688 | static int fallbacks[MIGRATE_TYPES][MIGRATE_TYPES-1] = { | 688 | static int fallbacks[MIGRATE_TYPES][MIGRATE_TYPES-1] = { |
689 | [MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE }, | 689 | [MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE }, |
690 | [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE }, | 690 | [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE }, |
691 | [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_RESERVE }, | 691 | [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_RESERVE }, |
692 | [MIGRATE_RESERVE] = { MIGRATE_RESERVE, MIGRATE_RESERVE, MIGRATE_RESERVE }, /* Never used */ | 692 | [MIGRATE_RESERVE] = { MIGRATE_RESERVE, MIGRATE_RESERVE, MIGRATE_RESERVE }, /* Never used */ |
693 | }; | 693 | }; |
694 | 694 | ||
695 | /* | 695 | /* |
696 | * Move the free pages in a range to the free lists of the requested type. | 696 | * Move the free pages in a range to the free lists of the requested type. |
697 | * Note that start_page and end_pages are not aligned on a pageblock | 697 | * Note that start_page and end_pages are not aligned on a pageblock |
698 | * boundary. If alignment is required, use move_freepages_block() | 698 | * boundary. If alignment is required, use move_freepages_block() |
699 | */ | 699 | */ |
700 | int move_freepages(struct zone *zone, | 700 | int move_freepages(struct zone *zone, |
701 | struct page *start_page, struct page *end_page, | 701 | struct page *start_page, struct page *end_page, |
702 | int migratetype) | 702 | int migratetype) |
703 | { | 703 | { |
704 | struct page *page; | 704 | struct page *page; |
705 | unsigned long order; | 705 | unsigned long order; |
706 | int pages_moved = 0; | 706 | int pages_moved = 0; |
707 | 707 | ||
708 | #ifndef CONFIG_HOLES_IN_ZONE | 708 | #ifndef CONFIG_HOLES_IN_ZONE |
709 | /* | 709 | /* |
710 | * page_zone is not safe to call in this context when | 710 | * page_zone is not safe to call in this context when |
711 | * CONFIG_HOLES_IN_ZONE is set. This bug check is probably redundant | 711 | * CONFIG_HOLES_IN_ZONE is set. This bug check is probably redundant |
712 | * anyway as we check zone boundaries in move_freepages_block(). | 712 | * anyway as we check zone boundaries in move_freepages_block(). |
713 | * Remove at a later date when no bug reports exist related to | 713 | * Remove at a later date when no bug reports exist related to |
714 | * grouping pages by mobility | 714 | * grouping pages by mobility |
715 | */ | 715 | */ |
716 | BUG_ON(page_zone(start_page) != page_zone(end_page)); | 716 | BUG_ON(page_zone(start_page) != page_zone(end_page)); |
717 | #endif | 717 | #endif |
718 | 718 | ||
719 | for (page = start_page; page <= end_page;) { | 719 | for (page = start_page; page <= end_page;) { |
720 | if (!pfn_valid_within(page_to_pfn(page))) { | 720 | if (!pfn_valid_within(page_to_pfn(page))) { |
721 | page++; | 721 | page++; |
722 | continue; | 722 | continue; |
723 | } | 723 | } |
724 | 724 | ||
725 | if (!PageBuddy(page)) { | 725 | if (!PageBuddy(page)) { |
726 | page++; | 726 | page++; |
727 | continue; | 727 | continue; |
728 | } | 728 | } |
729 | 729 | ||
730 | order = page_order(page); | 730 | order = page_order(page); |
731 | list_del(&page->lru); | 731 | list_del(&page->lru); |
732 | list_add(&page->lru, | 732 | list_add(&page->lru, |
733 | &zone->free_area[order].free_list[migratetype]); | 733 | &zone->free_area[order].free_list[migratetype]); |
734 | page += 1 << order; | 734 | page += 1 << order; |
735 | pages_moved += 1 << order; | 735 | pages_moved += 1 << order; |
736 | } | 736 | } |
737 | 737 | ||
738 | return pages_moved; | 738 | return pages_moved; |
739 | } | 739 | } |
740 | 740 | ||
741 | int move_freepages_block(struct zone *zone, struct page *page, int migratetype) | 741 | int move_freepages_block(struct zone *zone, struct page *page, int migratetype) |
742 | { | 742 | { |
743 | unsigned long start_pfn, end_pfn; | 743 | unsigned long start_pfn, end_pfn; |
744 | struct page *start_page, *end_page; | 744 | struct page *start_page, *end_page; |
745 | 745 | ||
746 | start_pfn = page_to_pfn(page); | 746 | start_pfn = page_to_pfn(page); |
747 | start_pfn = start_pfn & ~(pageblock_nr_pages-1); | 747 | start_pfn = start_pfn & ~(pageblock_nr_pages-1); |
748 | start_page = pfn_to_page(start_pfn); | 748 | start_page = pfn_to_page(start_pfn); |
749 | end_page = start_page + pageblock_nr_pages - 1; | 749 | end_page = start_page + pageblock_nr_pages - 1; |
750 | end_pfn = start_pfn + pageblock_nr_pages - 1; | 750 | end_pfn = start_pfn + pageblock_nr_pages - 1; |
751 | 751 | ||
752 | /* Do not cross zone boundaries */ | 752 | /* Do not cross zone boundaries */ |
753 | if (start_pfn < zone->zone_start_pfn) | 753 | if (start_pfn < zone->zone_start_pfn) |
754 | start_page = page; | 754 | start_page = page; |
755 | if (end_pfn >= zone->zone_start_pfn + zone->spanned_pages) | 755 | if (end_pfn >= zone->zone_start_pfn + zone->spanned_pages) |
756 | return 0; | 756 | return 0; |
757 | 757 | ||
758 | return move_freepages(zone, start_page, end_page, migratetype); | 758 | return move_freepages(zone, start_page, end_page, migratetype); |
759 | } | 759 | } |
760 | 760 | ||
761 | /* Remove an element from the buddy allocator from the fallback list */ | 761 | /* Remove an element from the buddy allocator from the fallback list */ |
762 | static struct page *__rmqueue_fallback(struct zone *zone, int order, | 762 | static struct page *__rmqueue_fallback(struct zone *zone, int order, |
763 | int start_migratetype) | 763 | int start_migratetype) |
764 | { | 764 | { |
765 | struct free_area * area; | 765 | struct free_area * area; |
766 | int current_order; | 766 | int current_order; |
767 | struct page *page; | 767 | struct page *page; |
768 | int migratetype, i; | 768 | int migratetype, i; |
769 | 769 | ||
770 | /* Find the largest possible block of pages in the other list */ | 770 | /* Find the largest possible block of pages in the other list */ |
771 | for (current_order = MAX_ORDER-1; current_order >= order; | 771 | for (current_order = MAX_ORDER-1; current_order >= order; |
772 | --current_order) { | 772 | --current_order) { |
773 | for (i = 0; i < MIGRATE_TYPES - 1; i++) { | 773 | for (i = 0; i < MIGRATE_TYPES - 1; i++) { |
774 | migratetype = fallbacks[start_migratetype][i]; | 774 | migratetype = fallbacks[start_migratetype][i]; |
775 | 775 | ||
776 | /* MIGRATE_RESERVE handled later if necessary */ | 776 | /* MIGRATE_RESERVE handled later if necessary */ |
777 | if (migratetype == MIGRATE_RESERVE) | 777 | if (migratetype == MIGRATE_RESERVE) |
778 | continue; | 778 | continue; |
779 | 779 | ||
780 | area = &(zone->free_area[current_order]); | 780 | area = &(zone->free_area[current_order]); |
781 | if (list_empty(&area->free_list[migratetype])) | 781 | if (list_empty(&area->free_list[migratetype])) |
782 | continue; | 782 | continue; |
783 | 783 | ||
784 | page = list_entry(area->free_list[migratetype].next, | 784 | page = list_entry(area->free_list[migratetype].next, |
785 | struct page, lru); | 785 | struct page, lru); |
786 | area->nr_free--; | 786 | area->nr_free--; |
787 | 787 | ||
788 | /* | 788 | /* |
789 | * If breaking a large block of pages, move all free | 789 | * If breaking a large block of pages, move all free |
790 | * pages to the preferred allocation list. If falling | 790 | * pages to the preferred allocation list. If falling |
791 | * back for a reclaimable kernel allocation, be more | 791 | * back for a reclaimable kernel allocation, be more |
792 | * agressive about taking ownership of free pages | 792 | * agressive about taking ownership of free pages |
793 | */ | 793 | */ |
794 | if (unlikely(current_order >= (pageblock_order >> 1)) || | 794 | if (unlikely(current_order >= (pageblock_order >> 1)) || |
795 | start_migratetype == MIGRATE_RECLAIMABLE) { | 795 | start_migratetype == MIGRATE_RECLAIMABLE) { |
796 | unsigned long pages; | 796 | unsigned long pages; |
797 | pages = move_freepages_block(zone, page, | 797 | pages = move_freepages_block(zone, page, |
798 | start_migratetype); | 798 | start_migratetype); |
799 | 799 | ||
800 | /* Claim the whole block if over half of it is free */ | 800 | /* Claim the whole block if over half of it is free */ |
801 | if (pages >= (1 << (pageblock_order-1))) | 801 | if (pages >= (1 << (pageblock_order-1))) |
802 | set_pageblock_migratetype(page, | 802 | set_pageblock_migratetype(page, |
803 | start_migratetype); | 803 | start_migratetype); |
804 | 804 | ||
805 | migratetype = start_migratetype; | 805 | migratetype = start_migratetype; |
806 | } | 806 | } |
807 | 807 | ||
808 | /* Remove the page from the freelists */ | 808 | /* Remove the page from the freelists */ |
809 | list_del(&page->lru); | 809 | list_del(&page->lru); |
810 | rmv_page_order(page); | 810 | rmv_page_order(page); |
811 | __mod_zone_page_state(zone, NR_FREE_PAGES, | 811 | __mod_zone_page_state(zone, NR_FREE_PAGES, |
812 | -(1UL << order)); | 812 | -(1UL << order)); |
813 | 813 | ||
814 | if (current_order == pageblock_order) | 814 | if (current_order == pageblock_order) |
815 | set_pageblock_migratetype(page, | 815 | set_pageblock_migratetype(page, |
816 | start_migratetype); | 816 | start_migratetype); |
817 | 817 | ||
818 | expand(zone, page, order, current_order, area, migratetype); | 818 | expand(zone, page, order, current_order, area, migratetype); |
819 | return page; | 819 | return page; |
820 | } | 820 | } |
821 | } | 821 | } |
822 | 822 | ||
823 | /* Use MIGRATE_RESERVE rather than fail an allocation */ | 823 | /* Use MIGRATE_RESERVE rather than fail an allocation */ |
824 | return __rmqueue_smallest(zone, order, MIGRATE_RESERVE); | 824 | return __rmqueue_smallest(zone, order, MIGRATE_RESERVE); |
825 | } | 825 | } |
826 | 826 | ||
827 | /* | 827 | /* |
828 | * Do the hard work of removing an element from the buddy allocator. | 828 | * Do the hard work of removing an element from the buddy allocator. |
829 | * Call me with the zone->lock already held. | 829 | * Call me with the zone->lock already held. |
830 | */ | 830 | */ |
831 | static struct page *__rmqueue(struct zone *zone, unsigned int order, | 831 | static struct page *__rmqueue(struct zone *zone, unsigned int order, |
832 | int migratetype) | 832 | int migratetype) |
833 | { | 833 | { |
834 | struct page *page; | 834 | struct page *page; |
835 | 835 | ||
836 | page = __rmqueue_smallest(zone, order, migratetype); | 836 | page = __rmqueue_smallest(zone, order, migratetype); |
837 | 837 | ||
838 | if (unlikely(!page)) | 838 | if (unlikely(!page)) |
839 | page = __rmqueue_fallback(zone, order, migratetype); | 839 | page = __rmqueue_fallback(zone, order, migratetype); |
840 | 840 | ||
841 | return page; | 841 | return page; |
842 | } | 842 | } |
843 | 843 | ||
844 | /* | 844 | /* |
845 | * Obtain a specified number of elements from the buddy allocator, all under | 845 | * Obtain a specified number of elements from the buddy allocator, all under |
846 | * a single hold of the lock, for efficiency. Add them to the supplied list. | 846 | * a single hold of the lock, for efficiency. Add them to the supplied list. |
847 | * Returns the number of new pages which were placed at *list. | 847 | * Returns the number of new pages which were placed at *list. |
848 | */ | 848 | */ |
849 | static int rmqueue_bulk(struct zone *zone, unsigned int order, | 849 | static int rmqueue_bulk(struct zone *zone, unsigned int order, |
850 | unsigned long count, struct list_head *list, | 850 | unsigned long count, struct list_head *list, |
851 | int migratetype) | 851 | int migratetype) |
852 | { | 852 | { |
853 | int i; | 853 | int i; |
854 | 854 | ||
855 | spin_lock(&zone->lock); | 855 | spin_lock(&zone->lock); |
856 | for (i = 0; i < count; ++i) { | 856 | for (i = 0; i < count; ++i) { |
857 | struct page *page = __rmqueue(zone, order, migratetype); | 857 | struct page *page = __rmqueue(zone, order, migratetype); |
858 | if (unlikely(page == NULL)) | 858 | if (unlikely(page == NULL)) |
859 | break; | 859 | break; |
860 | 860 | ||
861 | /* | 861 | /* |
862 | * Split buddy pages returned by expand() are received here | 862 | * Split buddy pages returned by expand() are received here |
863 | * in physical page order. The page is added to the callers and | 863 | * in physical page order. The page is added to the callers and |
864 | * list and the list head then moves forward. From the callers | 864 | * list and the list head then moves forward. From the callers |
865 | * perspective, the linked list is ordered by page number in | 865 | * perspective, the linked list is ordered by page number in |
866 | * some conditions. This is useful for IO devices that can | 866 | * some conditions. This is useful for IO devices that can |
867 | * merge IO requests if the physical pages are ordered | 867 | * merge IO requests if the physical pages are ordered |
868 | * properly. | 868 | * properly. |
869 | */ | 869 | */ |
870 | list_add(&page->lru, list); | 870 | list_add(&page->lru, list); |
871 | set_page_private(page, migratetype); | 871 | set_page_private(page, migratetype); |
872 | list = &page->lru; | 872 | list = &page->lru; |
873 | } | 873 | } |
874 | spin_unlock(&zone->lock); | 874 | spin_unlock(&zone->lock); |
875 | return i; | 875 | return i; |
876 | } | 876 | } |
877 | 877 | ||
878 | #ifdef CONFIG_NUMA | 878 | #ifdef CONFIG_NUMA |
879 | /* | 879 | /* |
880 | * Called from the vmstat counter updater to drain pagesets of this | 880 | * Called from the vmstat counter updater to drain pagesets of this |
881 | * currently executing processor on remote nodes after they have | 881 | * currently executing processor on remote nodes after they have |
882 | * expired. | 882 | * expired. |
883 | * | 883 | * |
884 | * Note that this function must be called with the thread pinned to | 884 | * Note that this function must be called with the thread pinned to |
885 | * a single processor. | 885 | * a single processor. |
886 | */ | 886 | */ |
887 | void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) | 887 | void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) |
888 | { | 888 | { |
889 | unsigned long flags; | 889 | unsigned long flags; |
890 | int to_drain; | 890 | int to_drain; |
891 | 891 | ||
892 | local_irq_save(flags); | 892 | local_irq_save(flags); |
893 | if (pcp->count >= pcp->batch) | 893 | if (pcp->count >= pcp->batch) |
894 | to_drain = pcp->batch; | 894 | to_drain = pcp->batch; |
895 | else | 895 | else |
896 | to_drain = pcp->count; | 896 | to_drain = pcp->count; |
897 | free_pages_bulk(zone, to_drain, &pcp->list, 0); | 897 | free_pages_bulk(zone, to_drain, &pcp->list, 0); |
898 | pcp->count -= to_drain; | 898 | pcp->count -= to_drain; |
899 | local_irq_restore(flags); | 899 | local_irq_restore(flags); |
900 | } | 900 | } |
901 | #endif | 901 | #endif |
902 | 902 | ||
903 | /* | 903 | /* |
904 | * Drain pages of the indicated processor. | 904 | * Drain pages of the indicated processor. |
905 | * | 905 | * |
906 | * The processor must either be the current processor and the | 906 | * The processor must either be the current processor and the |
907 | * thread pinned to the current processor or a processor that | 907 | * thread pinned to the current processor or a processor that |
908 | * is not online. | 908 | * is not online. |
909 | */ | 909 | */ |
910 | static void drain_pages(unsigned int cpu) | 910 | static void drain_pages(unsigned int cpu) |
911 | { | 911 | { |
912 | unsigned long flags; | 912 | unsigned long flags; |
913 | struct zone *zone; | 913 | struct zone *zone; |
914 | 914 | ||
915 | for_each_zone(zone) { | 915 | for_each_zone(zone) { |
916 | struct per_cpu_pageset *pset; | 916 | struct per_cpu_pageset *pset; |
917 | struct per_cpu_pages *pcp; | 917 | struct per_cpu_pages *pcp; |
918 | 918 | ||
919 | if (!populated_zone(zone)) | 919 | if (!populated_zone(zone)) |
920 | continue; | 920 | continue; |
921 | 921 | ||
922 | pset = zone_pcp(zone, cpu); | 922 | pset = zone_pcp(zone, cpu); |
923 | 923 | ||
924 | pcp = &pset->pcp; | 924 | pcp = &pset->pcp; |
925 | local_irq_save(flags); | 925 | local_irq_save(flags); |
926 | free_pages_bulk(zone, pcp->count, &pcp->list, 0); | 926 | free_pages_bulk(zone, pcp->count, &pcp->list, 0); |
927 | pcp->count = 0; | 927 | pcp->count = 0; |
928 | local_irq_restore(flags); | 928 | local_irq_restore(flags); |
929 | } | 929 | } |
930 | } | 930 | } |
931 | 931 | ||
932 | /* | 932 | /* |
933 | * Spill all of this CPU's per-cpu pages back into the buddy allocator. | 933 | * Spill all of this CPU's per-cpu pages back into the buddy allocator. |
934 | */ | 934 | */ |
935 | void drain_local_pages(void *arg) | 935 | void drain_local_pages(void *arg) |
936 | { | 936 | { |
937 | drain_pages(smp_processor_id()); | 937 | drain_pages(smp_processor_id()); |
938 | } | 938 | } |
939 | 939 | ||
940 | /* | 940 | /* |
941 | * Spill all the per-cpu pages from all CPUs back into the buddy allocator | 941 | * Spill all the per-cpu pages from all CPUs back into the buddy allocator |
942 | */ | 942 | */ |
943 | void drain_all_pages(void) | 943 | void drain_all_pages(void) |
944 | { | 944 | { |
945 | on_each_cpu(drain_local_pages, NULL, 0, 1); | 945 | on_each_cpu(drain_local_pages, NULL, 0, 1); |
946 | } | 946 | } |
947 | 947 | ||
948 | #ifdef CONFIG_HIBERNATION | 948 | #ifdef CONFIG_HIBERNATION |
949 | 949 | ||
950 | void mark_free_pages(struct zone *zone) | 950 | void mark_free_pages(struct zone *zone) |
951 | { | 951 | { |
952 | unsigned long pfn, max_zone_pfn; | 952 | unsigned long pfn, max_zone_pfn; |
953 | unsigned long flags; | 953 | unsigned long flags; |
954 | int order, t; | 954 | int order, t; |
955 | struct list_head *curr; | 955 | struct list_head *curr; |
956 | 956 | ||
957 | if (!zone->spanned_pages) | 957 | if (!zone->spanned_pages) |
958 | return; | 958 | return; |
959 | 959 | ||
960 | spin_lock_irqsave(&zone->lock, flags); | 960 | spin_lock_irqsave(&zone->lock, flags); |
961 | 961 | ||
962 | max_zone_pfn = zone->zone_start_pfn + zone->spanned_pages; | 962 | max_zone_pfn = zone->zone_start_pfn + zone->spanned_pages; |
963 | for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) | 963 | for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) |
964 | if (pfn_valid(pfn)) { | 964 | if (pfn_valid(pfn)) { |
965 | struct page *page = pfn_to_page(pfn); | 965 | struct page *page = pfn_to_page(pfn); |
966 | 966 | ||
967 | if (!swsusp_page_is_forbidden(page)) | 967 | if (!swsusp_page_is_forbidden(page)) |
968 | swsusp_unset_page_free(page); | 968 | swsusp_unset_page_free(page); |
969 | } | 969 | } |
970 | 970 | ||
971 | for_each_migratetype_order(order, t) { | 971 | for_each_migratetype_order(order, t) { |
972 | list_for_each(curr, &zone->free_area[order].free_list[t]) { | 972 | list_for_each(curr, &zone->free_area[order].free_list[t]) { |
973 | unsigned long i; | 973 | unsigned long i; |
974 | 974 | ||
975 | pfn = page_to_pfn(list_entry(curr, struct page, lru)); | 975 | pfn = page_to_pfn(list_entry(curr, struct page, lru)); |
976 | for (i = 0; i < (1UL << order); i++) | 976 | for (i = 0; i < (1UL << order); i++) |
977 | swsusp_set_page_free(pfn_to_page(pfn + i)); | 977 | swsusp_set_page_free(pfn_to_page(pfn + i)); |
978 | } | 978 | } |
979 | } | 979 | } |
980 | spin_unlock_irqrestore(&zone->lock, flags); | 980 | spin_unlock_irqrestore(&zone->lock, flags); |
981 | } | 981 | } |
982 | #endif /* CONFIG_PM */ | 982 | #endif /* CONFIG_PM */ |
983 | 983 | ||
984 | /* | 984 | /* |
985 | * Free a 0-order page | 985 | * Free a 0-order page |
986 | */ | 986 | */ |
987 | static void free_hot_cold_page(struct page *page, int cold) | 987 | static void free_hot_cold_page(struct page *page, int cold) |
988 | { | 988 | { |
989 | struct zone *zone = page_zone(page); | 989 | struct zone *zone = page_zone(page); |
990 | struct per_cpu_pages *pcp; | 990 | struct per_cpu_pages *pcp; |
991 | unsigned long flags; | 991 | unsigned long flags; |
992 | 992 | ||
993 | if (PageAnon(page)) | 993 | if (PageAnon(page)) |
994 | page->mapping = NULL; | 994 | page->mapping = NULL; |
995 | if (free_pages_check(page)) | 995 | if (free_pages_check(page)) |
996 | return; | 996 | return; |
997 | 997 | ||
998 | if (!PageHighMem(page)) | 998 | if (!PageHighMem(page)) |
999 | debug_check_no_locks_freed(page_address(page), PAGE_SIZE); | 999 | debug_check_no_locks_freed(page_address(page), PAGE_SIZE); |
1000 | arch_free_page(page, 0); | 1000 | arch_free_page(page, 0); |
1001 | kernel_map_pages(page, 1, 0); | 1001 | kernel_map_pages(page, 1, 0); |
1002 | 1002 | ||
1003 | pcp = &zone_pcp(zone, get_cpu())->pcp; | 1003 | pcp = &zone_pcp(zone, get_cpu())->pcp; |
1004 | local_irq_save(flags); | 1004 | local_irq_save(flags); |
1005 | __count_vm_event(PGFREE); | 1005 | __count_vm_event(PGFREE); |
1006 | if (cold) | 1006 | if (cold) |
1007 | list_add_tail(&page->lru, &pcp->list); | 1007 | list_add_tail(&page->lru, &pcp->list); |
1008 | else | 1008 | else |
1009 | list_add(&page->lru, &pcp->list); | 1009 | list_add(&page->lru, &pcp->list); |
1010 | set_page_private(page, get_pageblock_migratetype(page)); | 1010 | set_page_private(page, get_pageblock_migratetype(page)); |
1011 | pcp->count++; | 1011 | pcp->count++; |
1012 | if (pcp->count >= pcp->high) { | 1012 | if (pcp->count >= pcp->high) { |
1013 | free_pages_bulk(zone, pcp->batch, &pcp->list, 0); | 1013 | free_pages_bulk(zone, pcp->batch, &pcp->list, 0); |
1014 | pcp->count -= pcp->batch; | 1014 | pcp->count -= pcp->batch; |
1015 | } | 1015 | } |
1016 | local_irq_restore(flags); | 1016 | local_irq_restore(flags); |
1017 | put_cpu(); | 1017 | put_cpu(); |
1018 | } | 1018 | } |
1019 | 1019 | ||
1020 | void free_hot_page(struct page *page) | 1020 | void free_hot_page(struct page *page) |
1021 | { | 1021 | { |
1022 | free_hot_cold_page(page, 0); | 1022 | free_hot_cold_page(page, 0); |
1023 | } | 1023 | } |
1024 | 1024 | ||
1025 | void free_cold_page(struct page *page) | 1025 | void free_cold_page(struct page *page) |
1026 | { | 1026 | { |
1027 | free_hot_cold_page(page, 1); | 1027 | free_hot_cold_page(page, 1); |
1028 | } | 1028 | } |
1029 | 1029 | ||
1030 | /* | 1030 | /* |
1031 | * split_page takes a non-compound higher-order page, and splits it into | 1031 | * split_page takes a non-compound higher-order page, and splits it into |
1032 | * n (1<<order) sub-pages: page[0..n] | 1032 | * n (1<<order) sub-pages: page[0..n] |
1033 | * Each sub-page must be freed individually. | 1033 | * Each sub-page must be freed individually. |
1034 | * | 1034 | * |
1035 | * Note: this is probably too low level an operation for use in drivers. | 1035 | * Note: this is probably too low level an operation for use in drivers. |
1036 | * Please consult with lkml before using this in your driver. | 1036 | * Please consult with lkml before using this in your driver. |
1037 | */ | 1037 | */ |
1038 | void split_page(struct page *page, unsigned int order) | 1038 | void split_page(struct page *page, unsigned int order) |
1039 | { | 1039 | { |
1040 | int i; | 1040 | int i; |
1041 | 1041 | ||
1042 | VM_BUG_ON(PageCompound(page)); | 1042 | VM_BUG_ON(PageCompound(page)); |
1043 | VM_BUG_ON(!page_count(page)); | 1043 | VM_BUG_ON(!page_count(page)); |
1044 | for (i = 1; i < (1 << order); i++) | 1044 | for (i = 1; i < (1 << order); i++) |
1045 | set_page_refcounted(page + i); | 1045 | set_page_refcounted(page + i); |
1046 | } | 1046 | } |
1047 | 1047 | ||
1048 | /* | 1048 | /* |
1049 | * Really, prep_compound_page() should be called from __rmqueue_bulk(). But | 1049 | * Really, prep_compound_page() should be called from __rmqueue_bulk(). But |
1050 | * we cheat by calling it from here, in the order > 0 path. Saves a branch | 1050 | * we cheat by calling it from here, in the order > 0 path. Saves a branch |
1051 | * or two. | 1051 | * or two. |
1052 | */ | 1052 | */ |
1053 | static struct page *buffered_rmqueue(struct zone *preferred_zone, | 1053 | static struct page *buffered_rmqueue(struct zone *preferred_zone, |
1054 | struct zone *zone, int order, gfp_t gfp_flags) | 1054 | struct zone *zone, int order, gfp_t gfp_flags) |
1055 | { | 1055 | { |
1056 | unsigned long flags; | 1056 | unsigned long flags; |
1057 | struct page *page; | 1057 | struct page *page; |
1058 | int cold = !!(gfp_flags & __GFP_COLD); | 1058 | int cold = !!(gfp_flags & __GFP_COLD); |
1059 | int cpu; | 1059 | int cpu; |
1060 | int migratetype = allocflags_to_migratetype(gfp_flags); | 1060 | int migratetype = allocflags_to_migratetype(gfp_flags); |
1061 | 1061 | ||
1062 | again: | 1062 | again: |
1063 | cpu = get_cpu(); | 1063 | cpu = get_cpu(); |
1064 | if (likely(order == 0)) { | 1064 | if (likely(order == 0)) { |
1065 | struct per_cpu_pages *pcp; | 1065 | struct per_cpu_pages *pcp; |
1066 | 1066 | ||
1067 | pcp = &zone_pcp(zone, cpu)->pcp; | 1067 | pcp = &zone_pcp(zone, cpu)->pcp; |
1068 | local_irq_save(flags); | 1068 | local_irq_save(flags); |
1069 | if (!pcp->count) { | 1069 | if (!pcp->count) { |
1070 | pcp->count = rmqueue_bulk(zone, 0, | 1070 | pcp->count = rmqueue_bulk(zone, 0, |
1071 | pcp->batch, &pcp->list, migratetype); | 1071 | pcp->batch, &pcp->list, migratetype); |
1072 | if (unlikely(!pcp->count)) | 1072 | if (unlikely(!pcp->count)) |
1073 | goto failed; | 1073 | goto failed; |
1074 | } | 1074 | } |
1075 | 1075 | ||
1076 | /* Find a page of the appropriate migrate type */ | 1076 | /* Find a page of the appropriate migrate type */ |
1077 | if (cold) { | 1077 | if (cold) { |
1078 | list_for_each_entry_reverse(page, &pcp->list, lru) | 1078 | list_for_each_entry_reverse(page, &pcp->list, lru) |
1079 | if (page_private(page) == migratetype) | 1079 | if (page_private(page) == migratetype) |
1080 | break; | 1080 | break; |
1081 | } else { | 1081 | } else { |
1082 | list_for_each_entry(page, &pcp->list, lru) | 1082 | list_for_each_entry(page, &pcp->list, lru) |
1083 | if (page_private(page) == migratetype) | 1083 | if (page_private(page) == migratetype) |
1084 | break; | 1084 | break; |
1085 | } | 1085 | } |
1086 | 1086 | ||
1087 | /* Allocate more to the pcp list if necessary */ | 1087 | /* Allocate more to the pcp list if necessary */ |
1088 | if (unlikely(&page->lru == &pcp->list)) { | 1088 | if (unlikely(&page->lru == &pcp->list)) { |
1089 | pcp->count += rmqueue_bulk(zone, 0, | 1089 | pcp->count += rmqueue_bulk(zone, 0, |
1090 | pcp->batch, &pcp->list, migratetype); | 1090 | pcp->batch, &pcp->list, migratetype); |
1091 | page = list_entry(pcp->list.next, struct page, lru); | 1091 | page = list_entry(pcp->list.next, struct page, lru); |
1092 | } | 1092 | } |
1093 | 1093 | ||
1094 | list_del(&page->lru); | 1094 | list_del(&page->lru); |
1095 | pcp->count--; | 1095 | pcp->count--; |
1096 | } else { | 1096 | } else { |
1097 | spin_lock_irqsave(&zone->lock, flags); | 1097 | spin_lock_irqsave(&zone->lock, flags); |
1098 | page = __rmqueue(zone, order, migratetype); | 1098 | page = __rmqueue(zone, order, migratetype); |
1099 | spin_unlock(&zone->lock); | 1099 | spin_unlock(&zone->lock); |
1100 | if (!page) | 1100 | if (!page) |
1101 | goto failed; | 1101 | goto failed; |
1102 | } | 1102 | } |
1103 | 1103 | ||
1104 | __count_zone_vm_events(PGALLOC, zone, 1 << order); | 1104 | __count_zone_vm_events(PGALLOC, zone, 1 << order); |
1105 | zone_statistics(preferred_zone, zone); | 1105 | zone_statistics(preferred_zone, zone); |
1106 | local_irq_restore(flags); | 1106 | local_irq_restore(flags); |
1107 | put_cpu(); | 1107 | put_cpu(); |
1108 | 1108 | ||
1109 | VM_BUG_ON(bad_range(zone, page)); | 1109 | VM_BUG_ON(bad_range(zone, page)); |
1110 | if (prep_new_page(page, order, gfp_flags)) | 1110 | if (prep_new_page(page, order, gfp_flags)) |
1111 | goto again; | 1111 | goto again; |
1112 | return page; | 1112 | return page; |
1113 | 1113 | ||
1114 | failed: | 1114 | failed: |
1115 | local_irq_restore(flags); | 1115 | local_irq_restore(flags); |
1116 | put_cpu(); | 1116 | put_cpu(); |
1117 | return NULL; | 1117 | return NULL; |
1118 | } | 1118 | } |
1119 | 1119 | ||
1120 | #define ALLOC_NO_WATERMARKS 0x01 /* don't check watermarks at all */ | 1120 | #define ALLOC_NO_WATERMARKS 0x01 /* don't check watermarks at all */ |
1121 | #define ALLOC_WMARK_MIN 0x02 /* use pages_min watermark */ | 1121 | #define ALLOC_WMARK_MIN 0x02 /* use pages_min watermark */ |
1122 | #define ALLOC_WMARK_LOW 0x04 /* use pages_low watermark */ | 1122 | #define ALLOC_WMARK_LOW 0x04 /* use pages_low watermark */ |
1123 | #define ALLOC_WMARK_HIGH 0x08 /* use pages_high watermark */ | 1123 | #define ALLOC_WMARK_HIGH 0x08 /* use pages_high watermark */ |
1124 | #define ALLOC_HARDER 0x10 /* try to alloc harder */ | 1124 | #define ALLOC_HARDER 0x10 /* try to alloc harder */ |
1125 | #define ALLOC_HIGH 0x20 /* __GFP_HIGH set */ | 1125 | #define ALLOC_HIGH 0x20 /* __GFP_HIGH set */ |
1126 | #define ALLOC_CPUSET 0x40 /* check for correct cpuset */ | 1126 | #define ALLOC_CPUSET 0x40 /* check for correct cpuset */ |
1127 | 1127 | ||
1128 | #ifdef CONFIG_FAIL_PAGE_ALLOC | 1128 | #ifdef CONFIG_FAIL_PAGE_ALLOC |
1129 | 1129 | ||
1130 | static struct fail_page_alloc_attr { | 1130 | static struct fail_page_alloc_attr { |
1131 | struct fault_attr attr; | 1131 | struct fault_attr attr; |
1132 | 1132 | ||
1133 | u32 ignore_gfp_highmem; | 1133 | u32 ignore_gfp_highmem; |
1134 | u32 ignore_gfp_wait; | 1134 | u32 ignore_gfp_wait; |
1135 | u32 min_order; | 1135 | u32 min_order; |
1136 | 1136 | ||
1137 | #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS | 1137 | #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS |
1138 | 1138 | ||
1139 | struct dentry *ignore_gfp_highmem_file; | 1139 | struct dentry *ignore_gfp_highmem_file; |
1140 | struct dentry *ignore_gfp_wait_file; | 1140 | struct dentry *ignore_gfp_wait_file; |
1141 | struct dentry *min_order_file; | 1141 | struct dentry *min_order_file; |
1142 | 1142 | ||
1143 | #endif /* CONFIG_FAULT_INJECTION_DEBUG_FS */ | 1143 | #endif /* CONFIG_FAULT_INJECTION_DEBUG_FS */ |
1144 | 1144 | ||
1145 | } fail_page_alloc = { | 1145 | } fail_page_alloc = { |
1146 | .attr = FAULT_ATTR_INITIALIZER, | 1146 | .attr = FAULT_ATTR_INITIALIZER, |
1147 | .ignore_gfp_wait = 1, | 1147 | .ignore_gfp_wait = 1, |
1148 | .ignore_gfp_highmem = 1, | 1148 | .ignore_gfp_highmem = 1, |
1149 | .min_order = 1, | 1149 | .min_order = 1, |
1150 | }; | 1150 | }; |
1151 | 1151 | ||
1152 | static int __init setup_fail_page_alloc(char *str) | 1152 | static int __init setup_fail_page_alloc(char *str) |
1153 | { | 1153 | { |
1154 | return setup_fault_attr(&fail_page_alloc.attr, str); | 1154 | return setup_fault_attr(&fail_page_alloc.attr, str); |
1155 | } | 1155 | } |
1156 | __setup("fail_page_alloc=", setup_fail_page_alloc); | 1156 | __setup("fail_page_alloc=", setup_fail_page_alloc); |
1157 | 1157 | ||
1158 | static int should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) | 1158 | static int should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) |
1159 | { | 1159 | { |
1160 | if (order < fail_page_alloc.min_order) | 1160 | if (order < fail_page_alloc.min_order) |
1161 | return 0; | 1161 | return 0; |
1162 | if (gfp_mask & __GFP_NOFAIL) | 1162 | if (gfp_mask & __GFP_NOFAIL) |
1163 | return 0; | 1163 | return 0; |
1164 | if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM)) | 1164 | if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM)) |
1165 | return 0; | 1165 | return 0; |
1166 | if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & __GFP_WAIT)) | 1166 | if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & __GFP_WAIT)) |
1167 | return 0; | 1167 | return 0; |
1168 | 1168 | ||
1169 | return should_fail(&fail_page_alloc.attr, 1 << order); | 1169 | return should_fail(&fail_page_alloc.attr, 1 << order); |
1170 | } | 1170 | } |
1171 | 1171 | ||
1172 | #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS | 1172 | #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS |
1173 | 1173 | ||
1174 | static int __init fail_page_alloc_debugfs(void) | 1174 | static int __init fail_page_alloc_debugfs(void) |
1175 | { | 1175 | { |
1176 | mode_t mode = S_IFREG | S_IRUSR | S_IWUSR; | 1176 | mode_t mode = S_IFREG | S_IRUSR | S_IWUSR; |
1177 | struct dentry *dir; | 1177 | struct dentry *dir; |
1178 | int err; | 1178 | int err; |
1179 | 1179 | ||
1180 | err = init_fault_attr_dentries(&fail_page_alloc.attr, | 1180 | err = init_fault_attr_dentries(&fail_page_alloc.attr, |
1181 | "fail_page_alloc"); | 1181 | "fail_page_alloc"); |
1182 | if (err) | 1182 | if (err) |
1183 | return err; | 1183 | return err; |
1184 | dir = fail_page_alloc.attr.dentries.dir; | 1184 | dir = fail_page_alloc.attr.dentries.dir; |
1185 | 1185 | ||
1186 | fail_page_alloc.ignore_gfp_wait_file = | 1186 | fail_page_alloc.ignore_gfp_wait_file = |
1187 | debugfs_create_bool("ignore-gfp-wait", mode, dir, | 1187 | debugfs_create_bool("ignore-gfp-wait", mode, dir, |
1188 | &fail_page_alloc.ignore_gfp_wait); | 1188 | &fail_page_alloc.ignore_gfp_wait); |
1189 | 1189 | ||
1190 | fail_page_alloc.ignore_gfp_highmem_file = | 1190 | fail_page_alloc.ignore_gfp_highmem_file = |
1191 | debugfs_create_bool("ignore-gfp-highmem", mode, dir, | 1191 | debugfs_create_bool("ignore-gfp-highmem", mode, dir, |
1192 | &fail_page_alloc.ignore_gfp_highmem); | 1192 | &fail_page_alloc.ignore_gfp_highmem); |
1193 | fail_page_alloc.min_order_file = | 1193 | fail_page_alloc.min_order_file = |
1194 | debugfs_create_u32("min-order", mode, dir, | 1194 | debugfs_create_u32("min-order", mode, dir, |
1195 | &fail_page_alloc.min_order); | 1195 | &fail_page_alloc.min_order); |
1196 | 1196 | ||
1197 | if (!fail_page_alloc.ignore_gfp_wait_file || | 1197 | if (!fail_page_alloc.ignore_gfp_wait_file || |
1198 | !fail_page_alloc.ignore_gfp_highmem_file || | 1198 | !fail_page_alloc.ignore_gfp_highmem_file || |
1199 | !fail_page_alloc.min_order_file) { | 1199 | !fail_page_alloc.min_order_file) { |
1200 | err = -ENOMEM; | 1200 | err = -ENOMEM; |
1201 | debugfs_remove(fail_page_alloc.ignore_gfp_wait_file); | 1201 | debugfs_remove(fail_page_alloc.ignore_gfp_wait_file); |
1202 | debugfs_remove(fail_page_alloc.ignore_gfp_highmem_file); | 1202 | debugfs_remove(fail_page_alloc.ignore_gfp_highmem_file); |
1203 | debugfs_remove(fail_page_alloc.min_order_file); | 1203 | debugfs_remove(fail_page_alloc.min_order_file); |
1204 | cleanup_fault_attr_dentries(&fail_page_alloc.attr); | 1204 | cleanup_fault_attr_dentries(&fail_page_alloc.attr); |
1205 | } | 1205 | } |
1206 | 1206 | ||
1207 | return err; | 1207 | return err; |
1208 | } | 1208 | } |
1209 | 1209 | ||
1210 | late_initcall(fail_page_alloc_debugfs); | 1210 | late_initcall(fail_page_alloc_debugfs); |
1211 | 1211 | ||
1212 | #endif /* CONFIG_FAULT_INJECTION_DEBUG_FS */ | 1212 | #endif /* CONFIG_FAULT_INJECTION_DEBUG_FS */ |
1213 | 1213 | ||
1214 | #else /* CONFIG_FAIL_PAGE_ALLOC */ | 1214 | #else /* CONFIG_FAIL_PAGE_ALLOC */ |
1215 | 1215 | ||
1216 | static inline int should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) | 1216 | static inline int should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) |
1217 | { | 1217 | { |
1218 | return 0; | 1218 | return 0; |
1219 | } | 1219 | } |
1220 | 1220 | ||
1221 | #endif /* CONFIG_FAIL_PAGE_ALLOC */ | 1221 | #endif /* CONFIG_FAIL_PAGE_ALLOC */ |
1222 | 1222 | ||
1223 | /* | 1223 | /* |
1224 | * Return 1 if free pages are above 'mark'. This takes into account the order | 1224 | * Return 1 if free pages are above 'mark'. This takes into account the order |
1225 | * of the allocation. | 1225 | * of the allocation. |
1226 | */ | 1226 | */ |
1227 | int zone_watermark_ok(struct zone *z, int order, unsigned long mark, | 1227 | int zone_watermark_ok(struct zone *z, int order, unsigned long mark, |
1228 | int classzone_idx, int alloc_flags) | 1228 | int classzone_idx, int alloc_flags) |
1229 | { | 1229 | { |
1230 | /* free_pages my go negative - that's OK */ | 1230 | /* free_pages my go negative - that's OK */ |
1231 | long min = mark; | 1231 | long min = mark; |
1232 | long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1; | 1232 | long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1; |
1233 | int o; | 1233 | int o; |
1234 | 1234 | ||
1235 | if (alloc_flags & ALLOC_HIGH) | 1235 | if (alloc_flags & ALLOC_HIGH) |
1236 | min -= min / 2; | 1236 | min -= min / 2; |
1237 | if (alloc_flags & ALLOC_HARDER) | 1237 | if (alloc_flags & ALLOC_HARDER) |
1238 | min -= min / 4; | 1238 | min -= min / 4; |
1239 | 1239 | ||
1240 | if (free_pages <= min + z->lowmem_reserve[classzone_idx]) | 1240 | if (free_pages <= min + z->lowmem_reserve[classzone_idx]) |
1241 | return 0; | 1241 | return 0; |
1242 | for (o = 0; o < order; o++) { | 1242 | for (o = 0; o < order; o++) { |
1243 | /* At the next order, this order's pages become unavailable */ | 1243 | /* At the next order, this order's pages become unavailable */ |
1244 | free_pages -= z->free_area[o].nr_free << o; | 1244 | free_pages -= z->free_area[o].nr_free << o; |
1245 | 1245 | ||
1246 | /* Require fewer higher order pages to be free */ | 1246 | /* Require fewer higher order pages to be free */ |
1247 | min >>= 1; | 1247 | min >>= 1; |
1248 | 1248 | ||
1249 | if (free_pages <= min) | 1249 | if (free_pages <= min) |
1250 | return 0; | 1250 | return 0; |
1251 | } | 1251 | } |
1252 | return 1; | 1252 | return 1; |
1253 | } | 1253 | } |
1254 | 1254 | ||
1255 | #ifdef CONFIG_NUMA | 1255 | #ifdef CONFIG_NUMA |
1256 | /* | 1256 | /* |
1257 | * zlc_setup - Setup for "zonelist cache". Uses cached zone data to | 1257 | * zlc_setup - Setup for "zonelist cache". Uses cached zone data to |
1258 | * skip over zones that are not allowed by the cpuset, or that have | 1258 | * skip over zones that are not allowed by the cpuset, or that have |
1259 | * been recently (in last second) found to be nearly full. See further | 1259 | * been recently (in last second) found to be nearly full. See further |
1260 | * comments in mmzone.h. Reduces cache footprint of zonelist scans | 1260 | * comments in mmzone.h. Reduces cache footprint of zonelist scans |
1261 | * that have to skip over a lot of full or unallowed zones. | 1261 | * that have to skip over a lot of full or unallowed zones. |
1262 | * | 1262 | * |
1263 | * If the zonelist cache is present in the passed in zonelist, then | 1263 | * If the zonelist cache is present in the passed in zonelist, then |
1264 | * returns a pointer to the allowed node mask (either the current | 1264 | * returns a pointer to the allowed node mask (either the current |
1265 | * tasks mems_allowed, or node_states[N_HIGH_MEMORY].) | 1265 | * tasks mems_allowed, or node_states[N_HIGH_MEMORY].) |
1266 | * | 1266 | * |
1267 | * If the zonelist cache is not available for this zonelist, does | 1267 | * If the zonelist cache is not available for this zonelist, does |
1268 | * nothing and returns NULL. | 1268 | * nothing and returns NULL. |
1269 | * | 1269 | * |
1270 | * If the fullzones BITMAP in the zonelist cache is stale (more than | 1270 | * If the fullzones BITMAP in the zonelist cache is stale (more than |
1271 | * a second since last zap'd) then we zap it out (clear its bits.) | 1271 | * a second since last zap'd) then we zap it out (clear its bits.) |
1272 | * | 1272 | * |
1273 | * We hold off even calling zlc_setup, until after we've checked the | 1273 | * We hold off even calling zlc_setup, until after we've checked the |
1274 | * first zone in the zonelist, on the theory that most allocations will | 1274 | * first zone in the zonelist, on the theory that most allocations will |
1275 | * be satisfied from that first zone, so best to examine that zone as | 1275 | * be satisfied from that first zone, so best to examine that zone as |
1276 | * quickly as we can. | 1276 | * quickly as we can. |
1277 | */ | 1277 | */ |
1278 | static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags) | 1278 | static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags) |
1279 | { | 1279 | { |
1280 | struct zonelist_cache *zlc; /* cached zonelist speedup info */ | 1280 | struct zonelist_cache *zlc; /* cached zonelist speedup info */ |
1281 | nodemask_t *allowednodes; /* zonelist_cache approximation */ | 1281 | nodemask_t *allowednodes; /* zonelist_cache approximation */ |
1282 | 1282 | ||
1283 | zlc = zonelist->zlcache_ptr; | 1283 | zlc = zonelist->zlcache_ptr; |
1284 | if (!zlc) | 1284 | if (!zlc) |
1285 | return NULL; | 1285 | return NULL; |
1286 | 1286 | ||
1287 | if (time_after(jiffies, zlc->last_full_zap + HZ)) { | 1287 | if (time_after(jiffies, zlc->last_full_zap + HZ)) { |
1288 | bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST); | 1288 | bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST); |
1289 | zlc->last_full_zap = jiffies; | 1289 | zlc->last_full_zap = jiffies; |
1290 | } | 1290 | } |
1291 | 1291 | ||
1292 | allowednodes = !in_interrupt() && (alloc_flags & ALLOC_CPUSET) ? | 1292 | allowednodes = !in_interrupt() && (alloc_flags & ALLOC_CPUSET) ? |
1293 | &cpuset_current_mems_allowed : | 1293 | &cpuset_current_mems_allowed : |
1294 | &node_states[N_HIGH_MEMORY]; | 1294 | &node_states[N_HIGH_MEMORY]; |
1295 | return allowednodes; | 1295 | return allowednodes; |
1296 | } | 1296 | } |
1297 | 1297 | ||
1298 | /* | 1298 | /* |
1299 | * Given 'z' scanning a zonelist, run a couple of quick checks to see | 1299 | * Given 'z' scanning a zonelist, run a couple of quick checks to see |
1300 | * if it is worth looking at further for free memory: | 1300 | * if it is worth looking at further for free memory: |
1301 | * 1) Check that the zone isn't thought to be full (doesn't have its | 1301 | * 1) Check that the zone isn't thought to be full (doesn't have its |
1302 | * bit set in the zonelist_cache fullzones BITMAP). | 1302 | * bit set in the zonelist_cache fullzones BITMAP). |
1303 | * 2) Check that the zones node (obtained from the zonelist_cache | 1303 | * 2) Check that the zones node (obtained from the zonelist_cache |
1304 | * z_to_n[] mapping) is allowed in the passed in allowednodes mask. | 1304 | * z_to_n[] mapping) is allowed in the passed in allowednodes mask. |
1305 | * Return true (non-zero) if zone is worth looking at further, or | 1305 | * Return true (non-zero) if zone is worth looking at further, or |
1306 | * else return false (zero) if it is not. | 1306 | * else return false (zero) if it is not. |
1307 | * | 1307 | * |
1308 | * This check -ignores- the distinction between various watermarks, | 1308 | * This check -ignores- the distinction between various watermarks, |
1309 | * such as GFP_HIGH, GFP_ATOMIC, PF_MEMALLOC, ... If a zone is | 1309 | * such as GFP_HIGH, GFP_ATOMIC, PF_MEMALLOC, ... If a zone is |
1310 | * found to be full for any variation of these watermarks, it will | 1310 | * found to be full for any variation of these watermarks, it will |
1311 | * be considered full for up to one second by all requests, unless | 1311 | * be considered full for up to one second by all requests, unless |
1312 | * we are so low on memory on all allowed nodes that we are forced | 1312 | * we are so low on memory on all allowed nodes that we are forced |
1313 | * into the second scan of the zonelist. | 1313 | * into the second scan of the zonelist. |
1314 | * | 1314 | * |
1315 | * In the second scan we ignore this zonelist cache and exactly | 1315 | * In the second scan we ignore this zonelist cache and exactly |
1316 | * apply the watermarks to all zones, even it is slower to do so. | 1316 | * apply the watermarks to all zones, even it is slower to do so. |
1317 | * We are low on memory in the second scan, and should leave no stone | 1317 | * We are low on memory in the second scan, and should leave no stone |
1318 | * unturned looking for a free page. | 1318 | * unturned looking for a free page. |
1319 | */ | 1319 | */ |
1320 | static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z, | 1320 | static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z, |
1321 | nodemask_t *allowednodes) | 1321 | nodemask_t *allowednodes) |
1322 | { | 1322 | { |
1323 | struct zonelist_cache *zlc; /* cached zonelist speedup info */ | 1323 | struct zonelist_cache *zlc; /* cached zonelist speedup info */ |
1324 | int i; /* index of *z in zonelist zones */ | 1324 | int i; /* index of *z in zonelist zones */ |
1325 | int n; /* node that zone *z is on */ | 1325 | int n; /* node that zone *z is on */ |
1326 | 1326 | ||
1327 | zlc = zonelist->zlcache_ptr; | 1327 | zlc = zonelist->zlcache_ptr; |
1328 | if (!zlc) | 1328 | if (!zlc) |
1329 | return 1; | 1329 | return 1; |
1330 | 1330 | ||
1331 | i = z - zonelist->_zonerefs; | 1331 | i = z - zonelist->_zonerefs; |
1332 | n = zlc->z_to_n[i]; | 1332 | n = zlc->z_to_n[i]; |
1333 | 1333 | ||
1334 | /* This zone is worth trying if it is allowed but not full */ | 1334 | /* This zone is worth trying if it is allowed but not full */ |
1335 | return node_isset(n, *allowednodes) && !test_bit(i, zlc->fullzones); | 1335 | return node_isset(n, *allowednodes) && !test_bit(i, zlc->fullzones); |
1336 | } | 1336 | } |
1337 | 1337 | ||
1338 | /* | 1338 | /* |
1339 | * Given 'z' scanning a zonelist, set the corresponding bit in | 1339 | * Given 'z' scanning a zonelist, set the corresponding bit in |
1340 | * zlc->fullzones, so that subsequent attempts to allocate a page | 1340 | * zlc->fullzones, so that subsequent attempts to allocate a page |
1341 | * from that zone don't waste time re-examining it. | 1341 | * from that zone don't waste time re-examining it. |
1342 | */ | 1342 | */ |
1343 | static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z) | 1343 | static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z) |
1344 | { | 1344 | { |
1345 | struct zonelist_cache *zlc; /* cached zonelist speedup info */ | 1345 | struct zonelist_cache *zlc; /* cached zonelist speedup info */ |
1346 | int i; /* index of *z in zonelist zones */ | 1346 | int i; /* index of *z in zonelist zones */ |
1347 | 1347 | ||
1348 | zlc = zonelist->zlcache_ptr; | 1348 | zlc = zonelist->zlcache_ptr; |
1349 | if (!zlc) | 1349 | if (!zlc) |
1350 | return; | 1350 | return; |
1351 | 1351 | ||
1352 | i = z - zonelist->_zonerefs; | 1352 | i = z - zonelist->_zonerefs; |
1353 | 1353 | ||
1354 | set_bit(i, zlc->fullzones); | 1354 | set_bit(i, zlc->fullzones); |
1355 | } | 1355 | } |
1356 | 1356 | ||
1357 | #else /* CONFIG_NUMA */ | 1357 | #else /* CONFIG_NUMA */ |
1358 | 1358 | ||
1359 | static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags) | 1359 | static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags) |
1360 | { | 1360 | { |
1361 | return NULL; | 1361 | return NULL; |
1362 | } | 1362 | } |
1363 | 1363 | ||
1364 | static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z, | 1364 | static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z, |
1365 | nodemask_t *allowednodes) | 1365 | nodemask_t *allowednodes) |
1366 | { | 1366 | { |
1367 | return 1; | 1367 | return 1; |
1368 | } | 1368 | } |
1369 | 1369 | ||
1370 | static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z) | 1370 | static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z) |
1371 | { | 1371 | { |
1372 | } | 1372 | } |
1373 | #endif /* CONFIG_NUMA */ | 1373 | #endif /* CONFIG_NUMA */ |
1374 | 1374 | ||
1375 | /* | 1375 | /* |
1376 | * get_page_from_freelist goes through the zonelist trying to allocate | 1376 | * get_page_from_freelist goes through the zonelist trying to allocate |
1377 | * a page. | 1377 | * a page. |
1378 | */ | 1378 | */ |
1379 | static struct page * | 1379 | static struct page * |
1380 | get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order, | 1380 | get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order, |
1381 | struct zonelist *zonelist, int high_zoneidx, int alloc_flags) | 1381 | struct zonelist *zonelist, int high_zoneidx, int alloc_flags) |
1382 | { | 1382 | { |
1383 | struct zoneref *z; | 1383 | struct zoneref *z; |
1384 | struct page *page = NULL; | 1384 | struct page *page = NULL; |
1385 | int classzone_idx; | 1385 | int classzone_idx; |
1386 | struct zone *zone, *preferred_zone; | 1386 | struct zone *zone, *preferred_zone; |
1387 | nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */ | 1387 | nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */ |
1388 | int zlc_active = 0; /* set if using zonelist_cache */ | 1388 | int zlc_active = 0; /* set if using zonelist_cache */ |
1389 | int did_zlc_setup = 0; /* just call zlc_setup() one time */ | 1389 | int did_zlc_setup = 0; /* just call zlc_setup() one time */ |
1390 | 1390 | ||
1391 | (void)first_zones_zonelist(zonelist, high_zoneidx, nodemask, | 1391 | (void)first_zones_zonelist(zonelist, high_zoneidx, nodemask, |
1392 | &preferred_zone); | 1392 | &preferred_zone); |
1393 | classzone_idx = zone_idx(preferred_zone); | 1393 | classzone_idx = zone_idx(preferred_zone); |
1394 | 1394 | ||
1395 | zonelist_scan: | 1395 | zonelist_scan: |
1396 | /* | 1396 | /* |
1397 | * Scan zonelist, looking for a zone with enough free. | 1397 | * Scan zonelist, looking for a zone with enough free. |
1398 | * See also cpuset_zone_allowed() comment in kernel/cpuset.c. | 1398 | * See also cpuset_zone_allowed() comment in kernel/cpuset.c. |
1399 | */ | 1399 | */ |
1400 | for_each_zone_zonelist_nodemask(zone, z, zonelist, | 1400 | for_each_zone_zonelist_nodemask(zone, z, zonelist, |
1401 | high_zoneidx, nodemask) { | 1401 | high_zoneidx, nodemask) { |
1402 | if (NUMA_BUILD && zlc_active && | 1402 | if (NUMA_BUILD && zlc_active && |
1403 | !zlc_zone_worth_trying(zonelist, z, allowednodes)) | 1403 | !zlc_zone_worth_trying(zonelist, z, allowednodes)) |
1404 | continue; | 1404 | continue; |
1405 | if ((alloc_flags & ALLOC_CPUSET) && | 1405 | if ((alloc_flags & ALLOC_CPUSET) && |
1406 | !cpuset_zone_allowed_softwall(zone, gfp_mask)) | 1406 | !cpuset_zone_allowed_softwall(zone, gfp_mask)) |
1407 | goto try_next_zone; | 1407 | goto try_next_zone; |
1408 | 1408 | ||
1409 | if (!(alloc_flags & ALLOC_NO_WATERMARKS)) { | 1409 | if (!(alloc_flags & ALLOC_NO_WATERMARKS)) { |
1410 | unsigned long mark; | 1410 | unsigned long mark; |
1411 | if (alloc_flags & ALLOC_WMARK_MIN) | 1411 | if (alloc_flags & ALLOC_WMARK_MIN) |
1412 | mark = zone->pages_min; | 1412 | mark = zone->pages_min; |
1413 | else if (alloc_flags & ALLOC_WMARK_LOW) | 1413 | else if (alloc_flags & ALLOC_WMARK_LOW) |
1414 | mark = zone->pages_low; | 1414 | mark = zone->pages_low; |
1415 | else | 1415 | else |
1416 | mark = zone->pages_high; | 1416 | mark = zone->pages_high; |
1417 | if (!zone_watermark_ok(zone, order, mark, | 1417 | if (!zone_watermark_ok(zone, order, mark, |
1418 | classzone_idx, alloc_flags)) { | 1418 | classzone_idx, alloc_flags)) { |
1419 | if (!zone_reclaim_mode || | 1419 | if (!zone_reclaim_mode || |
1420 | !zone_reclaim(zone, gfp_mask, order)) | 1420 | !zone_reclaim(zone, gfp_mask, order)) |
1421 | goto this_zone_full; | 1421 | goto this_zone_full; |
1422 | } | 1422 | } |
1423 | } | 1423 | } |
1424 | 1424 | ||
1425 | page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask); | 1425 | page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask); |
1426 | if (page) | 1426 | if (page) |
1427 | break; | 1427 | break; |
1428 | this_zone_full: | 1428 | this_zone_full: |
1429 | if (NUMA_BUILD) | 1429 | if (NUMA_BUILD) |
1430 | zlc_mark_zone_full(zonelist, z); | 1430 | zlc_mark_zone_full(zonelist, z); |
1431 | try_next_zone: | 1431 | try_next_zone: |
1432 | if (NUMA_BUILD && !did_zlc_setup) { | 1432 | if (NUMA_BUILD && !did_zlc_setup) { |
1433 | /* we do zlc_setup after the first zone is tried */ | 1433 | /* we do zlc_setup after the first zone is tried */ |
1434 | allowednodes = zlc_setup(zonelist, alloc_flags); | 1434 | allowednodes = zlc_setup(zonelist, alloc_flags); |
1435 | zlc_active = 1; | 1435 | zlc_active = 1; |
1436 | did_zlc_setup = 1; | 1436 | did_zlc_setup = 1; |
1437 | } | 1437 | } |
1438 | } | 1438 | } |
1439 | 1439 | ||
1440 | if (unlikely(NUMA_BUILD && page == NULL && zlc_active)) { | 1440 | if (unlikely(NUMA_BUILD && page == NULL && zlc_active)) { |
1441 | /* Disable zlc cache for second zonelist scan */ | 1441 | /* Disable zlc cache for second zonelist scan */ |
1442 | zlc_active = 0; | 1442 | zlc_active = 0; |
1443 | goto zonelist_scan; | 1443 | goto zonelist_scan; |
1444 | } | 1444 | } |
1445 | return page; | 1445 | return page; |
1446 | } | 1446 | } |
1447 | 1447 | ||
1448 | /* | 1448 | /* |
1449 | * This is the 'heart' of the zoned buddy allocator. | 1449 | * This is the 'heart' of the zoned buddy allocator. |
1450 | */ | 1450 | */ |
1451 | static struct page * | 1451 | static struct page * |
1452 | __alloc_pages_internal(gfp_t gfp_mask, unsigned int order, | 1452 | __alloc_pages_internal(gfp_t gfp_mask, unsigned int order, |
1453 | struct zonelist *zonelist, nodemask_t *nodemask) | 1453 | struct zonelist *zonelist, nodemask_t *nodemask) |
1454 | { | 1454 | { |
1455 | const gfp_t wait = gfp_mask & __GFP_WAIT; | 1455 | const gfp_t wait = gfp_mask & __GFP_WAIT; |
1456 | enum zone_type high_zoneidx = gfp_zone(gfp_mask); | 1456 | enum zone_type high_zoneidx = gfp_zone(gfp_mask); |
1457 | struct zoneref *z; | 1457 | struct zoneref *z; |
1458 | struct zone *zone; | 1458 | struct zone *zone; |
1459 | struct page *page; | 1459 | struct page *page; |
1460 | struct reclaim_state reclaim_state; | 1460 | struct reclaim_state reclaim_state; |
1461 | struct task_struct *p = current; | 1461 | struct task_struct *p = current; |
1462 | int do_retry; | 1462 | int do_retry; |
1463 | int alloc_flags; | 1463 | int alloc_flags; |
1464 | int did_some_progress; | 1464 | unsigned long did_some_progress; |
1465 | unsigned long pages_reclaimed = 0; | ||
1465 | 1466 | ||
1466 | might_sleep_if(wait); | 1467 | might_sleep_if(wait); |
1467 | 1468 | ||
1468 | if (should_fail_alloc_page(gfp_mask, order)) | 1469 | if (should_fail_alloc_page(gfp_mask, order)) |
1469 | return NULL; | 1470 | return NULL; |
1470 | 1471 | ||
1471 | restart: | 1472 | restart: |
1472 | z = zonelist->_zonerefs; /* the list of zones suitable for gfp_mask */ | 1473 | z = zonelist->_zonerefs; /* the list of zones suitable for gfp_mask */ |
1473 | 1474 | ||
1474 | if (unlikely(!z->zone)) { | 1475 | if (unlikely(!z->zone)) { |
1475 | /* | 1476 | /* |
1476 | * Happens if we have an empty zonelist as a result of | 1477 | * Happens if we have an empty zonelist as a result of |
1477 | * GFP_THISNODE being used on a memoryless node | 1478 | * GFP_THISNODE being used on a memoryless node |
1478 | */ | 1479 | */ |
1479 | return NULL; | 1480 | return NULL; |
1480 | } | 1481 | } |
1481 | 1482 | ||
1482 | page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order, | 1483 | page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order, |
1483 | zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET); | 1484 | zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET); |
1484 | if (page) | 1485 | if (page) |
1485 | goto got_pg; | 1486 | goto got_pg; |
1486 | 1487 | ||
1487 | /* | 1488 | /* |
1488 | * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and | 1489 | * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and |
1489 | * __GFP_NOWARN set) should not cause reclaim since the subsystem | 1490 | * __GFP_NOWARN set) should not cause reclaim since the subsystem |
1490 | * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim | 1491 | * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim |
1491 | * using a larger set of nodes after it has established that the | 1492 | * using a larger set of nodes after it has established that the |
1492 | * allowed per node queues are empty and that nodes are | 1493 | * allowed per node queues are empty and that nodes are |
1493 | * over allocated. | 1494 | * over allocated. |
1494 | */ | 1495 | */ |
1495 | if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE) | 1496 | if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE) |
1496 | goto nopage; | 1497 | goto nopage; |
1497 | 1498 | ||
1498 | for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) | 1499 | for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) |
1499 | wakeup_kswapd(zone, order); | 1500 | wakeup_kswapd(zone, order); |
1500 | 1501 | ||
1501 | /* | 1502 | /* |
1502 | * OK, we're below the kswapd watermark and have kicked background | 1503 | * OK, we're below the kswapd watermark and have kicked background |
1503 | * reclaim. Now things get more complex, so set up alloc_flags according | 1504 | * reclaim. Now things get more complex, so set up alloc_flags according |
1504 | * to how we want to proceed. | 1505 | * to how we want to proceed. |
1505 | * | 1506 | * |
1506 | * The caller may dip into page reserves a bit more if the caller | 1507 | * The caller may dip into page reserves a bit more if the caller |
1507 | * cannot run direct reclaim, or if the caller has realtime scheduling | 1508 | * cannot run direct reclaim, or if the caller has realtime scheduling |
1508 | * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will | 1509 | * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will |
1509 | * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH). | 1510 | * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH). |
1510 | */ | 1511 | */ |
1511 | alloc_flags = ALLOC_WMARK_MIN; | 1512 | alloc_flags = ALLOC_WMARK_MIN; |
1512 | if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait) | 1513 | if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait) |
1513 | alloc_flags |= ALLOC_HARDER; | 1514 | alloc_flags |= ALLOC_HARDER; |
1514 | if (gfp_mask & __GFP_HIGH) | 1515 | if (gfp_mask & __GFP_HIGH) |
1515 | alloc_flags |= ALLOC_HIGH; | 1516 | alloc_flags |= ALLOC_HIGH; |
1516 | if (wait) | 1517 | if (wait) |
1517 | alloc_flags |= ALLOC_CPUSET; | 1518 | alloc_flags |= ALLOC_CPUSET; |
1518 | 1519 | ||
1519 | /* | 1520 | /* |
1520 | * Go through the zonelist again. Let __GFP_HIGH and allocations | 1521 | * Go through the zonelist again. Let __GFP_HIGH and allocations |
1521 | * coming from realtime tasks go deeper into reserves. | 1522 | * coming from realtime tasks go deeper into reserves. |
1522 | * | 1523 | * |
1523 | * This is the last chance, in general, before the goto nopage. | 1524 | * This is the last chance, in general, before the goto nopage. |
1524 | * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc. | 1525 | * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc. |
1525 | * See also cpuset_zone_allowed() comment in kernel/cpuset.c. | 1526 | * See also cpuset_zone_allowed() comment in kernel/cpuset.c. |
1526 | */ | 1527 | */ |
1527 | page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist, | 1528 | page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist, |
1528 | high_zoneidx, alloc_flags); | 1529 | high_zoneidx, alloc_flags); |
1529 | if (page) | 1530 | if (page) |
1530 | goto got_pg; | 1531 | goto got_pg; |
1531 | 1532 | ||
1532 | /* This allocation should allow future memory freeing. */ | 1533 | /* This allocation should allow future memory freeing. */ |
1533 | 1534 | ||
1534 | rebalance: | 1535 | rebalance: |
1535 | if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE))) | 1536 | if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE))) |
1536 | && !in_interrupt()) { | 1537 | && !in_interrupt()) { |
1537 | if (!(gfp_mask & __GFP_NOMEMALLOC)) { | 1538 | if (!(gfp_mask & __GFP_NOMEMALLOC)) { |
1538 | nofail_alloc: | 1539 | nofail_alloc: |
1539 | /* go through the zonelist yet again, ignoring mins */ | 1540 | /* go through the zonelist yet again, ignoring mins */ |
1540 | page = get_page_from_freelist(gfp_mask, nodemask, order, | 1541 | page = get_page_from_freelist(gfp_mask, nodemask, order, |
1541 | zonelist, high_zoneidx, ALLOC_NO_WATERMARKS); | 1542 | zonelist, high_zoneidx, ALLOC_NO_WATERMARKS); |
1542 | if (page) | 1543 | if (page) |
1543 | goto got_pg; | 1544 | goto got_pg; |
1544 | if (gfp_mask & __GFP_NOFAIL) { | 1545 | if (gfp_mask & __GFP_NOFAIL) { |
1545 | congestion_wait(WRITE, HZ/50); | 1546 | congestion_wait(WRITE, HZ/50); |
1546 | goto nofail_alloc; | 1547 | goto nofail_alloc; |
1547 | } | 1548 | } |
1548 | } | 1549 | } |
1549 | goto nopage; | 1550 | goto nopage; |
1550 | } | 1551 | } |
1551 | 1552 | ||
1552 | /* Atomic allocations - we can't balance anything */ | 1553 | /* Atomic allocations - we can't balance anything */ |
1553 | if (!wait) | 1554 | if (!wait) |
1554 | goto nopage; | 1555 | goto nopage; |
1555 | 1556 | ||
1556 | cond_resched(); | 1557 | cond_resched(); |
1557 | 1558 | ||
1558 | /* We now go into synchronous reclaim */ | 1559 | /* We now go into synchronous reclaim */ |
1559 | cpuset_memory_pressure_bump(); | 1560 | cpuset_memory_pressure_bump(); |
1560 | p->flags |= PF_MEMALLOC; | 1561 | p->flags |= PF_MEMALLOC; |
1561 | reclaim_state.reclaimed_slab = 0; | 1562 | reclaim_state.reclaimed_slab = 0; |
1562 | p->reclaim_state = &reclaim_state; | 1563 | p->reclaim_state = &reclaim_state; |
1563 | 1564 | ||
1564 | did_some_progress = try_to_free_pages(zonelist, order, gfp_mask); | 1565 | did_some_progress = try_to_free_pages(zonelist, order, gfp_mask); |
1565 | 1566 | ||
1566 | p->reclaim_state = NULL; | 1567 | p->reclaim_state = NULL; |
1567 | p->flags &= ~PF_MEMALLOC; | 1568 | p->flags &= ~PF_MEMALLOC; |
1568 | 1569 | ||
1569 | cond_resched(); | 1570 | cond_resched(); |
1570 | 1571 | ||
1571 | if (order != 0) | 1572 | if (order != 0) |
1572 | drain_all_pages(); | 1573 | drain_all_pages(); |
1573 | 1574 | ||
1574 | if (likely(did_some_progress)) { | 1575 | if (likely(did_some_progress)) { |
1575 | page = get_page_from_freelist(gfp_mask, nodemask, order, | 1576 | page = get_page_from_freelist(gfp_mask, nodemask, order, |
1576 | zonelist, high_zoneidx, alloc_flags); | 1577 | zonelist, high_zoneidx, alloc_flags); |
1577 | if (page) | 1578 | if (page) |
1578 | goto got_pg; | 1579 | goto got_pg; |
1579 | } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) { | 1580 | } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) { |
1580 | if (!try_set_zone_oom(zonelist, gfp_mask)) { | 1581 | if (!try_set_zone_oom(zonelist, gfp_mask)) { |
1581 | schedule_timeout_uninterruptible(1); | 1582 | schedule_timeout_uninterruptible(1); |
1582 | goto restart; | 1583 | goto restart; |
1583 | } | 1584 | } |
1584 | 1585 | ||
1585 | /* | 1586 | /* |
1586 | * Go through the zonelist yet one more time, keep | 1587 | * Go through the zonelist yet one more time, keep |
1587 | * very high watermark here, this is only to catch | 1588 | * very high watermark here, this is only to catch |
1588 | * a parallel oom killing, we must fail if we're still | 1589 | * a parallel oom killing, we must fail if we're still |
1589 | * under heavy pressure. | 1590 | * under heavy pressure. |
1590 | */ | 1591 | */ |
1591 | page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, | 1592 | page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, |
1592 | order, zonelist, high_zoneidx, | 1593 | order, zonelist, high_zoneidx, |
1593 | ALLOC_WMARK_HIGH|ALLOC_CPUSET); | 1594 | ALLOC_WMARK_HIGH|ALLOC_CPUSET); |
1594 | if (page) { | 1595 | if (page) { |
1595 | clear_zonelist_oom(zonelist, gfp_mask); | 1596 | clear_zonelist_oom(zonelist, gfp_mask); |
1596 | goto got_pg; | 1597 | goto got_pg; |
1597 | } | 1598 | } |
1598 | 1599 | ||
1599 | /* The OOM killer will not help higher order allocs so fail */ | 1600 | /* The OOM killer will not help higher order allocs so fail */ |
1600 | if (order > PAGE_ALLOC_COSTLY_ORDER) { | 1601 | if (order > PAGE_ALLOC_COSTLY_ORDER) { |
1601 | clear_zonelist_oom(zonelist, gfp_mask); | 1602 | clear_zonelist_oom(zonelist, gfp_mask); |
1602 | goto nopage; | 1603 | goto nopage; |
1603 | } | 1604 | } |
1604 | 1605 | ||
1605 | out_of_memory(zonelist, gfp_mask, order); | 1606 | out_of_memory(zonelist, gfp_mask, order); |
1606 | clear_zonelist_oom(zonelist, gfp_mask); | 1607 | clear_zonelist_oom(zonelist, gfp_mask); |
1607 | goto restart; | 1608 | goto restart; |
1608 | } | 1609 | } |
1609 | 1610 | ||
1610 | /* | 1611 | /* |
1611 | * Don't let big-order allocations loop unless the caller explicitly | 1612 | * Don't let big-order allocations loop unless the caller explicitly |
1612 | * requests that. Wait for some write requests to complete then retry. | 1613 | * requests that. Wait for some write requests to complete then retry. |
1613 | * | 1614 | * |
1614 | * In this implementation, either order <= PAGE_ALLOC_COSTLY_ORDER or | 1615 | * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER |
1615 | * __GFP_REPEAT mean __GFP_NOFAIL, but that may not be true in other | 1616 | * means __GFP_NOFAIL, but that may not be true in other |
1616 | * implementations. | 1617 | * implementations. |
1618 | * | ||
1619 | * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is | ||
1620 | * specified, then we retry until we no longer reclaim any pages | ||
1621 | * (above), or we've reclaimed an order of pages at least as | ||
1622 | * large as the allocation's order. In both cases, if the | ||
1623 | * allocation still fails, we stop retrying. | ||
1617 | */ | 1624 | */ |
1625 | pages_reclaimed += did_some_progress; | ||
1618 | do_retry = 0; | 1626 | do_retry = 0; |
1619 | if (!(gfp_mask & __GFP_NORETRY)) { | 1627 | if (!(gfp_mask & __GFP_NORETRY)) { |
1620 | if ((order <= PAGE_ALLOC_COSTLY_ORDER) || | 1628 | if (order <= PAGE_ALLOC_COSTLY_ORDER) { |
1621 | (gfp_mask & __GFP_REPEAT)) | ||
1622 | do_retry = 1; | 1629 | do_retry = 1; |
1630 | } else { | ||
1631 | if (gfp_mask & __GFP_REPEAT && | ||
1632 | pages_reclaimed < (1 << order)) | ||
1633 | do_retry = 1; | ||
1634 | } | ||
1623 | if (gfp_mask & __GFP_NOFAIL) | 1635 | if (gfp_mask & __GFP_NOFAIL) |
1624 | do_retry = 1; | 1636 | do_retry = 1; |
1625 | } | 1637 | } |
1626 | if (do_retry) { | 1638 | if (do_retry) { |
1627 | congestion_wait(WRITE, HZ/50); | 1639 | congestion_wait(WRITE, HZ/50); |
1628 | goto rebalance; | 1640 | goto rebalance; |
1629 | } | 1641 | } |
1630 | 1642 | ||
1631 | nopage: | 1643 | nopage: |
1632 | if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) { | 1644 | if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) { |
1633 | printk(KERN_WARNING "%s: page allocation failure." | 1645 | printk(KERN_WARNING "%s: page allocation failure." |
1634 | " order:%d, mode:0x%x\n", | 1646 | " order:%d, mode:0x%x\n", |
1635 | p->comm, order, gfp_mask); | 1647 | p->comm, order, gfp_mask); |
1636 | dump_stack(); | 1648 | dump_stack(); |
1637 | show_mem(); | 1649 | show_mem(); |
1638 | } | 1650 | } |
1639 | got_pg: | 1651 | got_pg: |
1640 | return page; | 1652 | return page; |
1641 | } | 1653 | } |
1642 | 1654 | ||
1643 | struct page * | 1655 | struct page * |
1644 | __alloc_pages(gfp_t gfp_mask, unsigned int order, | 1656 | __alloc_pages(gfp_t gfp_mask, unsigned int order, |
1645 | struct zonelist *zonelist) | 1657 | struct zonelist *zonelist) |
1646 | { | 1658 | { |
1647 | return __alloc_pages_internal(gfp_mask, order, zonelist, NULL); | 1659 | return __alloc_pages_internal(gfp_mask, order, zonelist, NULL); |
1648 | } | 1660 | } |
1649 | 1661 | ||
1650 | struct page * | 1662 | struct page * |
1651 | __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, | 1663 | __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, |
1652 | struct zonelist *zonelist, nodemask_t *nodemask) | 1664 | struct zonelist *zonelist, nodemask_t *nodemask) |
1653 | { | 1665 | { |
1654 | return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask); | 1666 | return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask); |
1655 | } | 1667 | } |
1656 | 1668 | ||
1657 | EXPORT_SYMBOL(__alloc_pages); | 1669 | EXPORT_SYMBOL(__alloc_pages); |
1658 | 1670 | ||
1659 | /* | 1671 | /* |
1660 | * Common helper functions. | 1672 | * Common helper functions. |
1661 | */ | 1673 | */ |
1662 | unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order) | 1674 | unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order) |
1663 | { | 1675 | { |
1664 | struct page * page; | 1676 | struct page * page; |
1665 | page = alloc_pages(gfp_mask, order); | 1677 | page = alloc_pages(gfp_mask, order); |
1666 | if (!page) | 1678 | if (!page) |
1667 | return 0; | 1679 | return 0; |
1668 | return (unsigned long) page_address(page); | 1680 | return (unsigned long) page_address(page); |
1669 | } | 1681 | } |
1670 | 1682 | ||
1671 | EXPORT_SYMBOL(__get_free_pages); | 1683 | EXPORT_SYMBOL(__get_free_pages); |
1672 | 1684 | ||
1673 | unsigned long get_zeroed_page(gfp_t gfp_mask) | 1685 | unsigned long get_zeroed_page(gfp_t gfp_mask) |
1674 | { | 1686 | { |
1675 | struct page * page; | 1687 | struct page * page; |
1676 | 1688 | ||
1677 | /* | 1689 | /* |
1678 | * get_zeroed_page() returns a 32-bit address, which cannot represent | 1690 | * get_zeroed_page() returns a 32-bit address, which cannot represent |
1679 | * a highmem page | 1691 | * a highmem page |
1680 | */ | 1692 | */ |
1681 | VM_BUG_ON((gfp_mask & __GFP_HIGHMEM) != 0); | 1693 | VM_BUG_ON((gfp_mask & __GFP_HIGHMEM) != 0); |
1682 | 1694 | ||
1683 | page = alloc_pages(gfp_mask | __GFP_ZERO, 0); | 1695 | page = alloc_pages(gfp_mask | __GFP_ZERO, 0); |
1684 | if (page) | 1696 | if (page) |
1685 | return (unsigned long) page_address(page); | 1697 | return (unsigned long) page_address(page); |
1686 | return 0; | 1698 | return 0; |
1687 | } | 1699 | } |
1688 | 1700 | ||
1689 | EXPORT_SYMBOL(get_zeroed_page); | 1701 | EXPORT_SYMBOL(get_zeroed_page); |
1690 | 1702 | ||
1691 | void __pagevec_free(struct pagevec *pvec) | 1703 | void __pagevec_free(struct pagevec *pvec) |
1692 | { | 1704 | { |
1693 | int i = pagevec_count(pvec); | 1705 | int i = pagevec_count(pvec); |
1694 | 1706 | ||
1695 | while (--i >= 0) | 1707 | while (--i >= 0) |
1696 | free_hot_cold_page(pvec->pages[i], pvec->cold); | 1708 | free_hot_cold_page(pvec->pages[i], pvec->cold); |
1697 | } | 1709 | } |
1698 | 1710 | ||
1699 | void __free_pages(struct page *page, unsigned int order) | 1711 | void __free_pages(struct page *page, unsigned int order) |
1700 | { | 1712 | { |
1701 | if (put_page_testzero(page)) { | 1713 | if (put_page_testzero(page)) { |
1702 | if (order == 0) | 1714 | if (order == 0) |
1703 | free_hot_page(page); | 1715 | free_hot_page(page); |
1704 | else | 1716 | else |
1705 | __free_pages_ok(page, order); | 1717 | __free_pages_ok(page, order); |
1706 | } | 1718 | } |
1707 | } | 1719 | } |
1708 | 1720 | ||
1709 | EXPORT_SYMBOL(__free_pages); | 1721 | EXPORT_SYMBOL(__free_pages); |
1710 | 1722 | ||
1711 | void free_pages(unsigned long addr, unsigned int order) | 1723 | void free_pages(unsigned long addr, unsigned int order) |
1712 | { | 1724 | { |
1713 | if (addr != 0) { | 1725 | if (addr != 0) { |
1714 | VM_BUG_ON(!virt_addr_valid((void *)addr)); | 1726 | VM_BUG_ON(!virt_addr_valid((void *)addr)); |
1715 | __free_pages(virt_to_page((void *)addr), order); | 1727 | __free_pages(virt_to_page((void *)addr), order); |
1716 | } | 1728 | } |
1717 | } | 1729 | } |
1718 | 1730 | ||
1719 | EXPORT_SYMBOL(free_pages); | 1731 | EXPORT_SYMBOL(free_pages); |
1720 | 1732 | ||
1721 | static unsigned int nr_free_zone_pages(int offset) | 1733 | static unsigned int nr_free_zone_pages(int offset) |
1722 | { | 1734 | { |
1723 | struct zoneref *z; | 1735 | struct zoneref *z; |
1724 | struct zone *zone; | 1736 | struct zone *zone; |
1725 | 1737 | ||
1726 | /* Just pick one node, since fallback list is circular */ | 1738 | /* Just pick one node, since fallback list is circular */ |
1727 | unsigned int sum = 0; | 1739 | unsigned int sum = 0; |
1728 | 1740 | ||
1729 | struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL); | 1741 | struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL); |
1730 | 1742 | ||
1731 | for_each_zone_zonelist(zone, z, zonelist, offset) { | 1743 | for_each_zone_zonelist(zone, z, zonelist, offset) { |
1732 | unsigned long size = zone->present_pages; | 1744 | unsigned long size = zone->present_pages; |
1733 | unsigned long high = zone->pages_high; | 1745 | unsigned long high = zone->pages_high; |
1734 | if (size > high) | 1746 | if (size > high) |
1735 | sum += size - high; | 1747 | sum += size - high; |
1736 | } | 1748 | } |
1737 | 1749 | ||
1738 | return sum; | 1750 | return sum; |
1739 | } | 1751 | } |
1740 | 1752 | ||
1741 | /* | 1753 | /* |
1742 | * Amount of free RAM allocatable within ZONE_DMA and ZONE_NORMAL | 1754 | * Amount of free RAM allocatable within ZONE_DMA and ZONE_NORMAL |
1743 | */ | 1755 | */ |
1744 | unsigned int nr_free_buffer_pages(void) | 1756 | unsigned int nr_free_buffer_pages(void) |
1745 | { | 1757 | { |
1746 | return nr_free_zone_pages(gfp_zone(GFP_USER)); | 1758 | return nr_free_zone_pages(gfp_zone(GFP_USER)); |
1747 | } | 1759 | } |
1748 | EXPORT_SYMBOL_GPL(nr_free_buffer_pages); | 1760 | EXPORT_SYMBOL_GPL(nr_free_buffer_pages); |
1749 | 1761 | ||
1750 | /* | 1762 | /* |
1751 | * Amount of free RAM allocatable within all zones | 1763 | * Amount of free RAM allocatable within all zones |
1752 | */ | 1764 | */ |
1753 | unsigned int nr_free_pagecache_pages(void) | 1765 | unsigned int nr_free_pagecache_pages(void) |
1754 | { | 1766 | { |
1755 | return nr_free_zone_pages(gfp_zone(GFP_HIGHUSER_MOVABLE)); | 1767 | return nr_free_zone_pages(gfp_zone(GFP_HIGHUSER_MOVABLE)); |
1756 | } | 1768 | } |
1757 | 1769 | ||
1758 | static inline void show_node(struct zone *zone) | 1770 | static inline void show_node(struct zone *zone) |
1759 | { | 1771 | { |
1760 | if (NUMA_BUILD) | 1772 | if (NUMA_BUILD) |
1761 | printk("Node %d ", zone_to_nid(zone)); | 1773 | printk("Node %d ", zone_to_nid(zone)); |
1762 | } | 1774 | } |
1763 | 1775 | ||
1764 | void si_meminfo(struct sysinfo *val) | 1776 | void si_meminfo(struct sysinfo *val) |
1765 | { | 1777 | { |
1766 | val->totalram = totalram_pages; | 1778 | val->totalram = totalram_pages; |
1767 | val->sharedram = 0; | 1779 | val->sharedram = 0; |
1768 | val->freeram = global_page_state(NR_FREE_PAGES); | 1780 | val->freeram = global_page_state(NR_FREE_PAGES); |
1769 | val->bufferram = nr_blockdev_pages(); | 1781 | val->bufferram = nr_blockdev_pages(); |
1770 | val->totalhigh = totalhigh_pages; | 1782 | val->totalhigh = totalhigh_pages; |
1771 | val->freehigh = nr_free_highpages(); | 1783 | val->freehigh = nr_free_highpages(); |
1772 | val->mem_unit = PAGE_SIZE; | 1784 | val->mem_unit = PAGE_SIZE; |
1773 | } | 1785 | } |
1774 | 1786 | ||
1775 | EXPORT_SYMBOL(si_meminfo); | 1787 | EXPORT_SYMBOL(si_meminfo); |
1776 | 1788 | ||
1777 | #ifdef CONFIG_NUMA | 1789 | #ifdef CONFIG_NUMA |
1778 | void si_meminfo_node(struct sysinfo *val, int nid) | 1790 | void si_meminfo_node(struct sysinfo *val, int nid) |
1779 | { | 1791 | { |
1780 | pg_data_t *pgdat = NODE_DATA(nid); | 1792 | pg_data_t *pgdat = NODE_DATA(nid); |
1781 | 1793 | ||
1782 | val->totalram = pgdat->node_present_pages; | 1794 | val->totalram = pgdat->node_present_pages; |
1783 | val->freeram = node_page_state(nid, NR_FREE_PAGES); | 1795 | val->freeram = node_page_state(nid, NR_FREE_PAGES); |
1784 | #ifdef CONFIG_HIGHMEM | 1796 | #ifdef CONFIG_HIGHMEM |
1785 | val->totalhigh = pgdat->node_zones[ZONE_HIGHMEM].present_pages; | 1797 | val->totalhigh = pgdat->node_zones[ZONE_HIGHMEM].present_pages; |
1786 | val->freehigh = zone_page_state(&pgdat->node_zones[ZONE_HIGHMEM], | 1798 | val->freehigh = zone_page_state(&pgdat->node_zones[ZONE_HIGHMEM], |
1787 | NR_FREE_PAGES); | 1799 | NR_FREE_PAGES); |
1788 | #else | 1800 | #else |
1789 | val->totalhigh = 0; | 1801 | val->totalhigh = 0; |
1790 | val->freehigh = 0; | 1802 | val->freehigh = 0; |
1791 | #endif | 1803 | #endif |
1792 | val->mem_unit = PAGE_SIZE; | 1804 | val->mem_unit = PAGE_SIZE; |
1793 | } | 1805 | } |
1794 | #endif | 1806 | #endif |
1795 | 1807 | ||
1796 | #define K(x) ((x) << (PAGE_SHIFT-10)) | 1808 | #define K(x) ((x) << (PAGE_SHIFT-10)) |
1797 | 1809 | ||
1798 | /* | 1810 | /* |
1799 | * Show free area list (used inside shift_scroll-lock stuff) | 1811 | * Show free area list (used inside shift_scroll-lock stuff) |
1800 | * We also calculate the percentage fragmentation. We do this by counting the | 1812 | * We also calculate the percentage fragmentation. We do this by counting the |
1801 | * memory on each free list with the exception of the first item on the list. | 1813 | * memory on each free list with the exception of the first item on the list. |
1802 | */ | 1814 | */ |
1803 | void show_free_areas(void) | 1815 | void show_free_areas(void) |
1804 | { | 1816 | { |
1805 | int cpu; | 1817 | int cpu; |
1806 | struct zone *zone; | 1818 | struct zone *zone; |
1807 | 1819 | ||
1808 | for_each_zone(zone) { | 1820 | for_each_zone(zone) { |
1809 | if (!populated_zone(zone)) | 1821 | if (!populated_zone(zone)) |
1810 | continue; | 1822 | continue; |
1811 | 1823 | ||
1812 | show_node(zone); | 1824 | show_node(zone); |
1813 | printk("%s per-cpu:\n", zone->name); | 1825 | printk("%s per-cpu:\n", zone->name); |
1814 | 1826 | ||
1815 | for_each_online_cpu(cpu) { | 1827 | for_each_online_cpu(cpu) { |
1816 | struct per_cpu_pageset *pageset; | 1828 | struct per_cpu_pageset *pageset; |
1817 | 1829 | ||
1818 | pageset = zone_pcp(zone, cpu); | 1830 | pageset = zone_pcp(zone, cpu); |
1819 | 1831 | ||
1820 | printk("CPU %4d: hi:%5d, btch:%4d usd:%4d\n", | 1832 | printk("CPU %4d: hi:%5d, btch:%4d usd:%4d\n", |
1821 | cpu, pageset->pcp.high, | 1833 | cpu, pageset->pcp.high, |
1822 | pageset->pcp.batch, pageset->pcp.count); | 1834 | pageset->pcp.batch, pageset->pcp.count); |
1823 | } | 1835 | } |
1824 | } | 1836 | } |
1825 | 1837 | ||
1826 | printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu unstable:%lu\n" | 1838 | printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu unstable:%lu\n" |
1827 | " free:%lu slab:%lu mapped:%lu pagetables:%lu bounce:%lu\n", | 1839 | " free:%lu slab:%lu mapped:%lu pagetables:%lu bounce:%lu\n", |
1828 | global_page_state(NR_ACTIVE), | 1840 | global_page_state(NR_ACTIVE), |
1829 | global_page_state(NR_INACTIVE), | 1841 | global_page_state(NR_INACTIVE), |
1830 | global_page_state(NR_FILE_DIRTY), | 1842 | global_page_state(NR_FILE_DIRTY), |
1831 | global_page_state(NR_WRITEBACK), | 1843 | global_page_state(NR_WRITEBACK), |
1832 | global_page_state(NR_UNSTABLE_NFS), | 1844 | global_page_state(NR_UNSTABLE_NFS), |
1833 | global_page_state(NR_FREE_PAGES), | 1845 | global_page_state(NR_FREE_PAGES), |
1834 | global_page_state(NR_SLAB_RECLAIMABLE) + | 1846 | global_page_state(NR_SLAB_RECLAIMABLE) + |
1835 | global_page_state(NR_SLAB_UNRECLAIMABLE), | 1847 | global_page_state(NR_SLAB_UNRECLAIMABLE), |
1836 | global_page_state(NR_FILE_MAPPED), | 1848 | global_page_state(NR_FILE_MAPPED), |
1837 | global_page_state(NR_PAGETABLE), | 1849 | global_page_state(NR_PAGETABLE), |
1838 | global_page_state(NR_BOUNCE)); | 1850 | global_page_state(NR_BOUNCE)); |
1839 | 1851 | ||
1840 | for_each_zone(zone) { | 1852 | for_each_zone(zone) { |
1841 | int i; | 1853 | int i; |
1842 | 1854 | ||
1843 | if (!populated_zone(zone)) | 1855 | if (!populated_zone(zone)) |
1844 | continue; | 1856 | continue; |
1845 | 1857 | ||
1846 | show_node(zone); | 1858 | show_node(zone); |
1847 | printk("%s" | 1859 | printk("%s" |
1848 | " free:%lukB" | 1860 | " free:%lukB" |
1849 | " min:%lukB" | 1861 | " min:%lukB" |
1850 | " low:%lukB" | 1862 | " low:%lukB" |
1851 | " high:%lukB" | 1863 | " high:%lukB" |
1852 | " active:%lukB" | 1864 | " active:%lukB" |
1853 | " inactive:%lukB" | 1865 | " inactive:%lukB" |
1854 | " present:%lukB" | 1866 | " present:%lukB" |
1855 | " pages_scanned:%lu" | 1867 | " pages_scanned:%lu" |
1856 | " all_unreclaimable? %s" | 1868 | " all_unreclaimable? %s" |
1857 | "\n", | 1869 | "\n", |
1858 | zone->name, | 1870 | zone->name, |
1859 | K(zone_page_state(zone, NR_FREE_PAGES)), | 1871 | K(zone_page_state(zone, NR_FREE_PAGES)), |
1860 | K(zone->pages_min), | 1872 | K(zone->pages_min), |
1861 | K(zone->pages_low), | 1873 | K(zone->pages_low), |
1862 | K(zone->pages_high), | 1874 | K(zone->pages_high), |
1863 | K(zone_page_state(zone, NR_ACTIVE)), | 1875 | K(zone_page_state(zone, NR_ACTIVE)), |
1864 | K(zone_page_state(zone, NR_INACTIVE)), | 1876 | K(zone_page_state(zone, NR_INACTIVE)), |
1865 | K(zone->present_pages), | 1877 | K(zone->present_pages), |
1866 | zone->pages_scanned, | 1878 | zone->pages_scanned, |
1867 | (zone_is_all_unreclaimable(zone) ? "yes" : "no") | 1879 | (zone_is_all_unreclaimable(zone) ? "yes" : "no") |
1868 | ); | 1880 | ); |
1869 | printk("lowmem_reserve[]:"); | 1881 | printk("lowmem_reserve[]:"); |
1870 | for (i = 0; i < MAX_NR_ZONES; i++) | 1882 | for (i = 0; i < MAX_NR_ZONES; i++) |
1871 | printk(" %lu", zone->lowmem_reserve[i]); | 1883 | printk(" %lu", zone->lowmem_reserve[i]); |
1872 | printk("\n"); | 1884 | printk("\n"); |
1873 | } | 1885 | } |
1874 | 1886 | ||
1875 | for_each_zone(zone) { | 1887 | for_each_zone(zone) { |
1876 | unsigned long nr[MAX_ORDER], flags, order, total = 0; | 1888 | unsigned long nr[MAX_ORDER], flags, order, total = 0; |
1877 | 1889 | ||
1878 | if (!populated_zone(zone)) | 1890 | if (!populated_zone(zone)) |
1879 | continue; | 1891 | continue; |
1880 | 1892 | ||
1881 | show_node(zone); | 1893 | show_node(zone); |
1882 | printk("%s: ", zone->name); | 1894 | printk("%s: ", zone->name); |
1883 | 1895 | ||
1884 | spin_lock_irqsave(&zone->lock, flags); | 1896 | spin_lock_irqsave(&zone->lock, flags); |
1885 | for (order = 0; order < MAX_ORDER; order++) { | 1897 | for (order = 0; order < MAX_ORDER; order++) { |
1886 | nr[order] = zone->free_area[order].nr_free; | 1898 | nr[order] = zone->free_area[order].nr_free; |
1887 | total += nr[order] << order; | 1899 | total += nr[order] << order; |
1888 | } | 1900 | } |
1889 | spin_unlock_irqrestore(&zone->lock, flags); | 1901 | spin_unlock_irqrestore(&zone->lock, flags); |
1890 | for (order = 0; order < MAX_ORDER; order++) | 1902 | for (order = 0; order < MAX_ORDER; order++) |
1891 | printk("%lu*%lukB ", nr[order], K(1UL) << order); | 1903 | printk("%lu*%lukB ", nr[order], K(1UL) << order); |
1892 | printk("= %lukB\n", K(total)); | 1904 | printk("= %lukB\n", K(total)); |
1893 | } | 1905 | } |
1894 | 1906 | ||
1895 | printk("%ld total pagecache pages\n", global_page_state(NR_FILE_PAGES)); | 1907 | printk("%ld total pagecache pages\n", global_page_state(NR_FILE_PAGES)); |
1896 | 1908 | ||
1897 | show_swap_cache_info(); | 1909 | show_swap_cache_info(); |
1898 | } | 1910 | } |
1899 | 1911 | ||
1900 | static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref) | 1912 | static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref) |
1901 | { | 1913 | { |
1902 | zoneref->zone = zone; | 1914 | zoneref->zone = zone; |
1903 | zoneref->zone_idx = zone_idx(zone); | 1915 | zoneref->zone_idx = zone_idx(zone); |
1904 | } | 1916 | } |
1905 | 1917 | ||
1906 | /* | 1918 | /* |
1907 | * Builds allocation fallback zone lists. | 1919 | * Builds allocation fallback zone lists. |
1908 | * | 1920 | * |
1909 | * Add all populated zones of a node to the zonelist. | 1921 | * Add all populated zones of a node to the zonelist. |
1910 | */ | 1922 | */ |
1911 | static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist, | 1923 | static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist, |
1912 | int nr_zones, enum zone_type zone_type) | 1924 | int nr_zones, enum zone_type zone_type) |
1913 | { | 1925 | { |
1914 | struct zone *zone; | 1926 | struct zone *zone; |
1915 | 1927 | ||
1916 | BUG_ON(zone_type >= MAX_NR_ZONES); | 1928 | BUG_ON(zone_type >= MAX_NR_ZONES); |
1917 | zone_type++; | 1929 | zone_type++; |
1918 | 1930 | ||
1919 | do { | 1931 | do { |
1920 | zone_type--; | 1932 | zone_type--; |
1921 | zone = pgdat->node_zones + zone_type; | 1933 | zone = pgdat->node_zones + zone_type; |
1922 | if (populated_zone(zone)) { | 1934 | if (populated_zone(zone)) { |
1923 | zoneref_set_zone(zone, | 1935 | zoneref_set_zone(zone, |
1924 | &zonelist->_zonerefs[nr_zones++]); | 1936 | &zonelist->_zonerefs[nr_zones++]); |
1925 | check_highest_zone(zone_type); | 1937 | check_highest_zone(zone_type); |
1926 | } | 1938 | } |
1927 | 1939 | ||
1928 | } while (zone_type); | 1940 | } while (zone_type); |
1929 | return nr_zones; | 1941 | return nr_zones; |
1930 | } | 1942 | } |
1931 | 1943 | ||
1932 | 1944 | ||
1933 | /* | 1945 | /* |
1934 | * zonelist_order: | 1946 | * zonelist_order: |
1935 | * 0 = automatic detection of better ordering. | 1947 | * 0 = automatic detection of better ordering. |
1936 | * 1 = order by ([node] distance, -zonetype) | 1948 | * 1 = order by ([node] distance, -zonetype) |
1937 | * 2 = order by (-zonetype, [node] distance) | 1949 | * 2 = order by (-zonetype, [node] distance) |
1938 | * | 1950 | * |
1939 | * If not NUMA, ZONELIST_ORDER_ZONE and ZONELIST_ORDER_NODE will create | 1951 | * If not NUMA, ZONELIST_ORDER_ZONE and ZONELIST_ORDER_NODE will create |
1940 | * the same zonelist. So only NUMA can configure this param. | 1952 | * the same zonelist. So only NUMA can configure this param. |
1941 | */ | 1953 | */ |
1942 | #define ZONELIST_ORDER_DEFAULT 0 | 1954 | #define ZONELIST_ORDER_DEFAULT 0 |
1943 | #define ZONELIST_ORDER_NODE 1 | 1955 | #define ZONELIST_ORDER_NODE 1 |
1944 | #define ZONELIST_ORDER_ZONE 2 | 1956 | #define ZONELIST_ORDER_ZONE 2 |
1945 | 1957 | ||
1946 | /* zonelist order in the kernel. | 1958 | /* zonelist order in the kernel. |
1947 | * set_zonelist_order() will set this to NODE or ZONE. | 1959 | * set_zonelist_order() will set this to NODE or ZONE. |
1948 | */ | 1960 | */ |
1949 | static int current_zonelist_order = ZONELIST_ORDER_DEFAULT; | 1961 | static int current_zonelist_order = ZONELIST_ORDER_DEFAULT; |
1950 | static char zonelist_order_name[3][8] = {"Default", "Node", "Zone"}; | 1962 | static char zonelist_order_name[3][8] = {"Default", "Node", "Zone"}; |
1951 | 1963 | ||
1952 | 1964 | ||
1953 | #ifdef CONFIG_NUMA | 1965 | #ifdef CONFIG_NUMA |
1954 | /* The value user specified ....changed by config */ | 1966 | /* The value user specified ....changed by config */ |
1955 | static int user_zonelist_order = ZONELIST_ORDER_DEFAULT; | 1967 | static int user_zonelist_order = ZONELIST_ORDER_DEFAULT; |
1956 | /* string for sysctl */ | 1968 | /* string for sysctl */ |
1957 | #define NUMA_ZONELIST_ORDER_LEN 16 | 1969 | #define NUMA_ZONELIST_ORDER_LEN 16 |
1958 | char numa_zonelist_order[16] = "default"; | 1970 | char numa_zonelist_order[16] = "default"; |
1959 | 1971 | ||
1960 | /* | 1972 | /* |
1961 | * interface for configure zonelist ordering. | 1973 | * interface for configure zonelist ordering. |
1962 | * command line option "numa_zonelist_order" | 1974 | * command line option "numa_zonelist_order" |
1963 | * = "[dD]efault - default, automatic configuration. | 1975 | * = "[dD]efault - default, automatic configuration. |
1964 | * = "[nN]ode - order by node locality, then by zone within node | 1976 | * = "[nN]ode - order by node locality, then by zone within node |
1965 | * = "[zZ]one - order by zone, then by locality within zone | 1977 | * = "[zZ]one - order by zone, then by locality within zone |
1966 | */ | 1978 | */ |
1967 | 1979 | ||
1968 | static int __parse_numa_zonelist_order(char *s) | 1980 | static int __parse_numa_zonelist_order(char *s) |
1969 | { | 1981 | { |
1970 | if (*s == 'd' || *s == 'D') { | 1982 | if (*s == 'd' || *s == 'D') { |
1971 | user_zonelist_order = ZONELIST_ORDER_DEFAULT; | 1983 | user_zonelist_order = ZONELIST_ORDER_DEFAULT; |
1972 | } else if (*s == 'n' || *s == 'N') { | 1984 | } else if (*s == 'n' || *s == 'N') { |
1973 | user_zonelist_order = ZONELIST_ORDER_NODE; | 1985 | user_zonelist_order = ZONELIST_ORDER_NODE; |
1974 | } else if (*s == 'z' || *s == 'Z') { | 1986 | } else if (*s == 'z' || *s == 'Z') { |
1975 | user_zonelist_order = ZONELIST_ORDER_ZONE; | 1987 | user_zonelist_order = ZONELIST_ORDER_ZONE; |
1976 | } else { | 1988 | } else { |
1977 | printk(KERN_WARNING | 1989 | printk(KERN_WARNING |
1978 | "Ignoring invalid numa_zonelist_order value: " | 1990 | "Ignoring invalid numa_zonelist_order value: " |
1979 | "%s\n", s); | 1991 | "%s\n", s); |
1980 | return -EINVAL; | 1992 | return -EINVAL; |
1981 | } | 1993 | } |
1982 | return 0; | 1994 | return 0; |
1983 | } | 1995 | } |
1984 | 1996 | ||
1985 | static __init int setup_numa_zonelist_order(char *s) | 1997 | static __init int setup_numa_zonelist_order(char *s) |
1986 | { | 1998 | { |
1987 | if (s) | 1999 | if (s) |
1988 | return __parse_numa_zonelist_order(s); | 2000 | return __parse_numa_zonelist_order(s); |
1989 | return 0; | 2001 | return 0; |
1990 | } | 2002 | } |
1991 | early_param("numa_zonelist_order", setup_numa_zonelist_order); | 2003 | early_param("numa_zonelist_order", setup_numa_zonelist_order); |
1992 | 2004 | ||
1993 | /* | 2005 | /* |
1994 | * sysctl handler for numa_zonelist_order | 2006 | * sysctl handler for numa_zonelist_order |
1995 | */ | 2007 | */ |
1996 | int numa_zonelist_order_handler(ctl_table *table, int write, | 2008 | int numa_zonelist_order_handler(ctl_table *table, int write, |
1997 | struct file *file, void __user *buffer, size_t *length, | 2009 | struct file *file, void __user *buffer, size_t *length, |
1998 | loff_t *ppos) | 2010 | loff_t *ppos) |
1999 | { | 2011 | { |
2000 | char saved_string[NUMA_ZONELIST_ORDER_LEN]; | 2012 | char saved_string[NUMA_ZONELIST_ORDER_LEN]; |
2001 | int ret; | 2013 | int ret; |
2002 | 2014 | ||
2003 | if (write) | 2015 | if (write) |
2004 | strncpy(saved_string, (char*)table->data, | 2016 | strncpy(saved_string, (char*)table->data, |
2005 | NUMA_ZONELIST_ORDER_LEN); | 2017 | NUMA_ZONELIST_ORDER_LEN); |
2006 | ret = proc_dostring(table, write, file, buffer, length, ppos); | 2018 | ret = proc_dostring(table, write, file, buffer, length, ppos); |
2007 | if (ret) | 2019 | if (ret) |
2008 | return ret; | 2020 | return ret; |
2009 | if (write) { | 2021 | if (write) { |
2010 | int oldval = user_zonelist_order; | 2022 | int oldval = user_zonelist_order; |
2011 | if (__parse_numa_zonelist_order((char*)table->data)) { | 2023 | if (__parse_numa_zonelist_order((char*)table->data)) { |
2012 | /* | 2024 | /* |
2013 | * bogus value. restore saved string | 2025 | * bogus value. restore saved string |
2014 | */ | 2026 | */ |
2015 | strncpy((char*)table->data, saved_string, | 2027 | strncpy((char*)table->data, saved_string, |
2016 | NUMA_ZONELIST_ORDER_LEN); | 2028 | NUMA_ZONELIST_ORDER_LEN); |
2017 | user_zonelist_order = oldval; | 2029 | user_zonelist_order = oldval; |
2018 | } else if (oldval != user_zonelist_order) | 2030 | } else if (oldval != user_zonelist_order) |
2019 | build_all_zonelists(); | 2031 | build_all_zonelists(); |
2020 | } | 2032 | } |
2021 | return 0; | 2033 | return 0; |
2022 | } | 2034 | } |
2023 | 2035 | ||
2024 | 2036 | ||
2025 | #define MAX_NODE_LOAD (num_online_nodes()) | 2037 | #define MAX_NODE_LOAD (num_online_nodes()) |
2026 | static int node_load[MAX_NUMNODES]; | 2038 | static int node_load[MAX_NUMNODES]; |
2027 | 2039 | ||
2028 | /** | 2040 | /** |
2029 | * find_next_best_node - find the next node that should appear in a given node's fallback list | 2041 | * find_next_best_node - find the next node that should appear in a given node's fallback list |
2030 | * @node: node whose fallback list we're appending | 2042 | * @node: node whose fallback list we're appending |
2031 | * @used_node_mask: nodemask_t of already used nodes | 2043 | * @used_node_mask: nodemask_t of already used nodes |
2032 | * | 2044 | * |
2033 | * We use a number of factors to determine which is the next node that should | 2045 | * We use a number of factors to determine which is the next node that should |
2034 | * appear on a given node's fallback list. The node should not have appeared | 2046 | * appear on a given node's fallback list. The node should not have appeared |
2035 | * already in @node's fallback list, and it should be the next closest node | 2047 | * already in @node's fallback list, and it should be the next closest node |
2036 | * according to the distance array (which contains arbitrary distance values | 2048 | * according to the distance array (which contains arbitrary distance values |
2037 | * from each node to each node in the system), and should also prefer nodes | 2049 | * from each node to each node in the system), and should also prefer nodes |
2038 | * with no CPUs, since presumably they'll have very little allocation pressure | 2050 | * with no CPUs, since presumably they'll have very little allocation pressure |
2039 | * on them otherwise. | 2051 | * on them otherwise. |
2040 | * It returns -1 if no node is found. | 2052 | * It returns -1 if no node is found. |
2041 | */ | 2053 | */ |
2042 | static int find_next_best_node(int node, nodemask_t *used_node_mask) | 2054 | static int find_next_best_node(int node, nodemask_t *used_node_mask) |
2043 | { | 2055 | { |
2044 | int n, val; | 2056 | int n, val; |
2045 | int min_val = INT_MAX; | 2057 | int min_val = INT_MAX; |
2046 | int best_node = -1; | 2058 | int best_node = -1; |
2047 | node_to_cpumask_ptr(tmp, 0); | 2059 | node_to_cpumask_ptr(tmp, 0); |
2048 | 2060 | ||
2049 | /* Use the local node if we haven't already */ | 2061 | /* Use the local node if we haven't already */ |
2050 | if (!node_isset(node, *used_node_mask)) { | 2062 | if (!node_isset(node, *used_node_mask)) { |
2051 | node_set(node, *used_node_mask); | 2063 | node_set(node, *used_node_mask); |
2052 | return node; | 2064 | return node; |
2053 | } | 2065 | } |
2054 | 2066 | ||
2055 | for_each_node_state(n, N_HIGH_MEMORY) { | 2067 | for_each_node_state(n, N_HIGH_MEMORY) { |
2056 | 2068 | ||
2057 | /* Don't want a node to appear more than once */ | 2069 | /* Don't want a node to appear more than once */ |
2058 | if (node_isset(n, *used_node_mask)) | 2070 | if (node_isset(n, *used_node_mask)) |
2059 | continue; | 2071 | continue; |
2060 | 2072 | ||
2061 | /* Use the distance array to find the distance */ | 2073 | /* Use the distance array to find the distance */ |
2062 | val = node_distance(node, n); | 2074 | val = node_distance(node, n); |
2063 | 2075 | ||
2064 | /* Penalize nodes under us ("prefer the next node") */ | 2076 | /* Penalize nodes under us ("prefer the next node") */ |
2065 | val += (n < node); | 2077 | val += (n < node); |
2066 | 2078 | ||
2067 | /* Give preference to headless and unused nodes */ | 2079 | /* Give preference to headless and unused nodes */ |
2068 | node_to_cpumask_ptr_next(tmp, n); | 2080 | node_to_cpumask_ptr_next(tmp, n); |
2069 | if (!cpus_empty(*tmp)) | 2081 | if (!cpus_empty(*tmp)) |
2070 | val += PENALTY_FOR_NODE_WITH_CPUS; | 2082 | val += PENALTY_FOR_NODE_WITH_CPUS; |
2071 | 2083 | ||
2072 | /* Slight preference for less loaded node */ | 2084 | /* Slight preference for less loaded node */ |
2073 | val *= (MAX_NODE_LOAD*MAX_NUMNODES); | 2085 | val *= (MAX_NODE_LOAD*MAX_NUMNODES); |
2074 | val += node_load[n]; | 2086 | val += node_load[n]; |
2075 | 2087 | ||
2076 | if (val < min_val) { | 2088 | if (val < min_val) { |
2077 | min_val = val; | 2089 | min_val = val; |
2078 | best_node = n; | 2090 | best_node = n; |
2079 | } | 2091 | } |
2080 | } | 2092 | } |
2081 | 2093 | ||
2082 | if (best_node >= 0) | 2094 | if (best_node >= 0) |
2083 | node_set(best_node, *used_node_mask); | 2095 | node_set(best_node, *used_node_mask); |
2084 | 2096 | ||
2085 | return best_node; | 2097 | return best_node; |
2086 | } | 2098 | } |
2087 | 2099 | ||
2088 | 2100 | ||
2089 | /* | 2101 | /* |
2090 | * Build zonelists ordered by node and zones within node. | 2102 | * Build zonelists ordered by node and zones within node. |
2091 | * This results in maximum locality--normal zone overflows into local | 2103 | * This results in maximum locality--normal zone overflows into local |
2092 | * DMA zone, if any--but risks exhausting DMA zone. | 2104 | * DMA zone, if any--but risks exhausting DMA zone. |
2093 | */ | 2105 | */ |
2094 | static void build_zonelists_in_node_order(pg_data_t *pgdat, int node) | 2106 | static void build_zonelists_in_node_order(pg_data_t *pgdat, int node) |
2095 | { | 2107 | { |
2096 | int j; | 2108 | int j; |
2097 | struct zonelist *zonelist; | 2109 | struct zonelist *zonelist; |
2098 | 2110 | ||
2099 | zonelist = &pgdat->node_zonelists[0]; | 2111 | zonelist = &pgdat->node_zonelists[0]; |
2100 | for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++) | 2112 | for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++) |
2101 | ; | 2113 | ; |
2102 | j = build_zonelists_node(NODE_DATA(node), zonelist, j, | 2114 | j = build_zonelists_node(NODE_DATA(node), zonelist, j, |
2103 | MAX_NR_ZONES - 1); | 2115 | MAX_NR_ZONES - 1); |
2104 | zonelist->_zonerefs[j].zone = NULL; | 2116 | zonelist->_zonerefs[j].zone = NULL; |
2105 | zonelist->_zonerefs[j].zone_idx = 0; | 2117 | zonelist->_zonerefs[j].zone_idx = 0; |
2106 | } | 2118 | } |
2107 | 2119 | ||
2108 | /* | 2120 | /* |
2109 | * Build gfp_thisnode zonelists | 2121 | * Build gfp_thisnode zonelists |
2110 | */ | 2122 | */ |
2111 | static void build_thisnode_zonelists(pg_data_t *pgdat) | 2123 | static void build_thisnode_zonelists(pg_data_t *pgdat) |
2112 | { | 2124 | { |
2113 | int j; | 2125 | int j; |
2114 | struct zonelist *zonelist; | 2126 | struct zonelist *zonelist; |
2115 | 2127 | ||
2116 | zonelist = &pgdat->node_zonelists[1]; | 2128 | zonelist = &pgdat->node_zonelists[1]; |
2117 | j = build_zonelists_node(pgdat, zonelist, 0, MAX_NR_ZONES - 1); | 2129 | j = build_zonelists_node(pgdat, zonelist, 0, MAX_NR_ZONES - 1); |
2118 | zonelist->_zonerefs[j].zone = NULL; | 2130 | zonelist->_zonerefs[j].zone = NULL; |
2119 | zonelist->_zonerefs[j].zone_idx = 0; | 2131 | zonelist->_zonerefs[j].zone_idx = 0; |
2120 | } | 2132 | } |
2121 | 2133 | ||
2122 | /* | 2134 | /* |
2123 | * Build zonelists ordered by zone and nodes within zones. | 2135 | * Build zonelists ordered by zone and nodes within zones. |
2124 | * This results in conserving DMA zone[s] until all Normal memory is | 2136 | * This results in conserving DMA zone[s] until all Normal memory is |
2125 | * exhausted, but results in overflowing to remote node while memory | 2137 | * exhausted, but results in overflowing to remote node while memory |
2126 | * may still exist in local DMA zone. | 2138 | * may still exist in local DMA zone. |
2127 | */ | 2139 | */ |
2128 | static int node_order[MAX_NUMNODES]; | 2140 | static int node_order[MAX_NUMNODES]; |
2129 | 2141 | ||
2130 | static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes) | 2142 | static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes) |
2131 | { | 2143 | { |
2132 | int pos, j, node; | 2144 | int pos, j, node; |
2133 | int zone_type; /* needs to be signed */ | 2145 | int zone_type; /* needs to be signed */ |
2134 | struct zone *z; | 2146 | struct zone *z; |
2135 | struct zonelist *zonelist; | 2147 | struct zonelist *zonelist; |
2136 | 2148 | ||
2137 | zonelist = &pgdat->node_zonelists[0]; | 2149 | zonelist = &pgdat->node_zonelists[0]; |
2138 | pos = 0; | 2150 | pos = 0; |
2139 | for (zone_type = MAX_NR_ZONES - 1; zone_type >= 0; zone_type--) { | 2151 | for (zone_type = MAX_NR_ZONES - 1; zone_type >= 0; zone_type--) { |
2140 | for (j = 0; j < nr_nodes; j++) { | 2152 | for (j = 0; j < nr_nodes; j++) { |
2141 | node = node_order[j]; | 2153 | node = node_order[j]; |
2142 | z = &NODE_DATA(node)->node_zones[zone_type]; | 2154 | z = &NODE_DATA(node)->node_zones[zone_type]; |
2143 | if (populated_zone(z)) { | 2155 | if (populated_zone(z)) { |
2144 | zoneref_set_zone(z, | 2156 | zoneref_set_zone(z, |
2145 | &zonelist->_zonerefs[pos++]); | 2157 | &zonelist->_zonerefs[pos++]); |
2146 | check_highest_zone(zone_type); | 2158 | check_highest_zone(zone_type); |
2147 | } | 2159 | } |
2148 | } | 2160 | } |
2149 | } | 2161 | } |
2150 | zonelist->_zonerefs[pos].zone = NULL; | 2162 | zonelist->_zonerefs[pos].zone = NULL; |
2151 | zonelist->_zonerefs[pos].zone_idx = 0; | 2163 | zonelist->_zonerefs[pos].zone_idx = 0; |
2152 | } | 2164 | } |
2153 | 2165 | ||
2154 | static int default_zonelist_order(void) | 2166 | static int default_zonelist_order(void) |
2155 | { | 2167 | { |
2156 | int nid, zone_type; | 2168 | int nid, zone_type; |
2157 | unsigned long low_kmem_size,total_size; | 2169 | unsigned long low_kmem_size,total_size; |
2158 | struct zone *z; | 2170 | struct zone *z; |
2159 | int average_size; | 2171 | int average_size; |
2160 | /* | 2172 | /* |
2161 | * ZONE_DMA and ZONE_DMA32 can be very small area in the sytem. | 2173 | * ZONE_DMA and ZONE_DMA32 can be very small area in the sytem. |
2162 | * If they are really small and used heavily, the system can fall | 2174 | * If they are really small and used heavily, the system can fall |
2163 | * into OOM very easily. | 2175 | * into OOM very easily. |
2164 | * This function detect ZONE_DMA/DMA32 size and confgigures zone order. | 2176 | * This function detect ZONE_DMA/DMA32 size and confgigures zone order. |
2165 | */ | 2177 | */ |
2166 | /* Is there ZONE_NORMAL ? (ex. ppc has only DMA zone..) */ | 2178 | /* Is there ZONE_NORMAL ? (ex. ppc has only DMA zone..) */ |
2167 | low_kmem_size = 0; | 2179 | low_kmem_size = 0; |
2168 | total_size = 0; | 2180 | total_size = 0; |
2169 | for_each_online_node(nid) { | 2181 | for_each_online_node(nid) { |
2170 | for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) { | 2182 | for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) { |
2171 | z = &NODE_DATA(nid)->node_zones[zone_type]; | 2183 | z = &NODE_DATA(nid)->node_zones[zone_type]; |
2172 | if (populated_zone(z)) { | 2184 | if (populated_zone(z)) { |
2173 | if (zone_type < ZONE_NORMAL) | 2185 | if (zone_type < ZONE_NORMAL) |
2174 | low_kmem_size += z->present_pages; | 2186 | low_kmem_size += z->present_pages; |
2175 | total_size += z->present_pages; | 2187 | total_size += z->present_pages; |
2176 | } | 2188 | } |
2177 | } | 2189 | } |
2178 | } | 2190 | } |
2179 | if (!low_kmem_size || /* there are no DMA area. */ | 2191 | if (!low_kmem_size || /* there are no DMA area. */ |
2180 | low_kmem_size > total_size/2) /* DMA/DMA32 is big. */ | 2192 | low_kmem_size > total_size/2) /* DMA/DMA32 is big. */ |
2181 | return ZONELIST_ORDER_NODE; | 2193 | return ZONELIST_ORDER_NODE; |
2182 | /* | 2194 | /* |
2183 | * look into each node's config. | 2195 | * look into each node's config. |
2184 | * If there is a node whose DMA/DMA32 memory is very big area on | 2196 | * If there is a node whose DMA/DMA32 memory is very big area on |
2185 | * local memory, NODE_ORDER may be suitable. | 2197 | * local memory, NODE_ORDER may be suitable. |
2186 | */ | 2198 | */ |
2187 | average_size = total_size / | 2199 | average_size = total_size / |
2188 | (nodes_weight(node_states[N_HIGH_MEMORY]) + 1); | 2200 | (nodes_weight(node_states[N_HIGH_MEMORY]) + 1); |
2189 | for_each_online_node(nid) { | 2201 | for_each_online_node(nid) { |
2190 | low_kmem_size = 0; | 2202 | low_kmem_size = 0; |
2191 | total_size = 0; | 2203 | total_size = 0; |
2192 | for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) { | 2204 | for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) { |
2193 | z = &NODE_DATA(nid)->node_zones[zone_type]; | 2205 | z = &NODE_DATA(nid)->node_zones[zone_type]; |
2194 | if (populated_zone(z)) { | 2206 | if (populated_zone(z)) { |
2195 | if (zone_type < ZONE_NORMAL) | 2207 | if (zone_type < ZONE_NORMAL) |
2196 | low_kmem_size += z->present_pages; | 2208 | low_kmem_size += z->present_pages; |
2197 | total_size += z->present_pages; | 2209 | total_size += z->present_pages; |
2198 | } | 2210 | } |
2199 | } | 2211 | } |
2200 | if (low_kmem_size && | 2212 | if (low_kmem_size && |
2201 | total_size > average_size && /* ignore small node */ | 2213 | total_size > average_size && /* ignore small node */ |
2202 | low_kmem_size > total_size * 70/100) | 2214 | low_kmem_size > total_size * 70/100) |
2203 | return ZONELIST_ORDER_NODE; | 2215 | return ZONELIST_ORDER_NODE; |
2204 | } | 2216 | } |
2205 | return ZONELIST_ORDER_ZONE; | 2217 | return ZONELIST_ORDER_ZONE; |
2206 | } | 2218 | } |
2207 | 2219 | ||
2208 | static void set_zonelist_order(void) | 2220 | static void set_zonelist_order(void) |
2209 | { | 2221 | { |
2210 | if (user_zonelist_order == ZONELIST_ORDER_DEFAULT) | 2222 | if (user_zonelist_order == ZONELIST_ORDER_DEFAULT) |
2211 | current_zonelist_order = default_zonelist_order(); | 2223 | current_zonelist_order = default_zonelist_order(); |
2212 | else | 2224 | else |
2213 | current_zonelist_order = user_zonelist_order; | 2225 | current_zonelist_order = user_zonelist_order; |
2214 | } | 2226 | } |
2215 | 2227 | ||
2216 | static void build_zonelists(pg_data_t *pgdat) | 2228 | static void build_zonelists(pg_data_t *pgdat) |
2217 | { | 2229 | { |
2218 | int j, node, load; | 2230 | int j, node, load; |
2219 | enum zone_type i; | 2231 | enum zone_type i; |
2220 | nodemask_t used_mask; | 2232 | nodemask_t used_mask; |
2221 | int local_node, prev_node; | 2233 | int local_node, prev_node; |
2222 | struct zonelist *zonelist; | 2234 | struct zonelist *zonelist; |
2223 | int order = current_zonelist_order; | 2235 | int order = current_zonelist_order; |
2224 | 2236 | ||
2225 | /* initialize zonelists */ | 2237 | /* initialize zonelists */ |
2226 | for (i = 0; i < MAX_ZONELISTS; i++) { | 2238 | for (i = 0; i < MAX_ZONELISTS; i++) { |
2227 | zonelist = pgdat->node_zonelists + i; | 2239 | zonelist = pgdat->node_zonelists + i; |
2228 | zonelist->_zonerefs[0].zone = NULL; | 2240 | zonelist->_zonerefs[0].zone = NULL; |
2229 | zonelist->_zonerefs[0].zone_idx = 0; | 2241 | zonelist->_zonerefs[0].zone_idx = 0; |
2230 | } | 2242 | } |
2231 | 2243 | ||
2232 | /* NUMA-aware ordering of nodes */ | 2244 | /* NUMA-aware ordering of nodes */ |
2233 | local_node = pgdat->node_id; | 2245 | local_node = pgdat->node_id; |
2234 | load = num_online_nodes(); | 2246 | load = num_online_nodes(); |
2235 | prev_node = local_node; | 2247 | prev_node = local_node; |
2236 | nodes_clear(used_mask); | 2248 | nodes_clear(used_mask); |
2237 | 2249 | ||
2238 | memset(node_load, 0, sizeof(node_load)); | 2250 | memset(node_load, 0, sizeof(node_load)); |
2239 | memset(node_order, 0, sizeof(node_order)); | 2251 | memset(node_order, 0, sizeof(node_order)); |
2240 | j = 0; | 2252 | j = 0; |
2241 | 2253 | ||
2242 | while ((node = find_next_best_node(local_node, &used_mask)) >= 0) { | 2254 | while ((node = find_next_best_node(local_node, &used_mask)) >= 0) { |
2243 | int distance = node_distance(local_node, node); | 2255 | int distance = node_distance(local_node, node); |
2244 | 2256 | ||
2245 | /* | 2257 | /* |
2246 | * If another node is sufficiently far away then it is better | 2258 | * If another node is sufficiently far away then it is better |
2247 | * to reclaim pages in a zone before going off node. | 2259 | * to reclaim pages in a zone before going off node. |
2248 | */ | 2260 | */ |
2249 | if (distance > RECLAIM_DISTANCE) | 2261 | if (distance > RECLAIM_DISTANCE) |
2250 | zone_reclaim_mode = 1; | 2262 | zone_reclaim_mode = 1; |
2251 | 2263 | ||
2252 | /* | 2264 | /* |
2253 | * We don't want to pressure a particular node. | 2265 | * We don't want to pressure a particular node. |
2254 | * So adding penalty to the first node in same | 2266 | * So adding penalty to the first node in same |
2255 | * distance group to make it round-robin. | 2267 | * distance group to make it round-robin. |
2256 | */ | 2268 | */ |
2257 | if (distance != node_distance(local_node, prev_node)) | 2269 | if (distance != node_distance(local_node, prev_node)) |
2258 | node_load[node] = load; | 2270 | node_load[node] = load; |
2259 | 2271 | ||
2260 | prev_node = node; | 2272 | prev_node = node; |
2261 | load--; | 2273 | load--; |
2262 | if (order == ZONELIST_ORDER_NODE) | 2274 | if (order == ZONELIST_ORDER_NODE) |
2263 | build_zonelists_in_node_order(pgdat, node); | 2275 | build_zonelists_in_node_order(pgdat, node); |
2264 | else | 2276 | else |
2265 | node_order[j++] = node; /* remember order */ | 2277 | node_order[j++] = node; /* remember order */ |
2266 | } | 2278 | } |
2267 | 2279 | ||
2268 | if (order == ZONELIST_ORDER_ZONE) { | 2280 | if (order == ZONELIST_ORDER_ZONE) { |
2269 | /* calculate node order -- i.e., DMA last! */ | 2281 | /* calculate node order -- i.e., DMA last! */ |
2270 | build_zonelists_in_zone_order(pgdat, j); | 2282 | build_zonelists_in_zone_order(pgdat, j); |
2271 | } | 2283 | } |
2272 | 2284 | ||
2273 | build_thisnode_zonelists(pgdat); | 2285 | build_thisnode_zonelists(pgdat); |
2274 | } | 2286 | } |
2275 | 2287 | ||
2276 | /* Construct the zonelist performance cache - see further mmzone.h */ | 2288 | /* Construct the zonelist performance cache - see further mmzone.h */ |
2277 | static void build_zonelist_cache(pg_data_t *pgdat) | 2289 | static void build_zonelist_cache(pg_data_t *pgdat) |
2278 | { | 2290 | { |
2279 | struct zonelist *zonelist; | 2291 | struct zonelist *zonelist; |
2280 | struct zonelist_cache *zlc; | 2292 | struct zonelist_cache *zlc; |
2281 | struct zoneref *z; | 2293 | struct zoneref *z; |
2282 | 2294 | ||
2283 | zonelist = &pgdat->node_zonelists[0]; | 2295 | zonelist = &pgdat->node_zonelists[0]; |
2284 | zonelist->zlcache_ptr = zlc = &zonelist->zlcache; | 2296 | zonelist->zlcache_ptr = zlc = &zonelist->zlcache; |
2285 | bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST); | 2297 | bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST); |
2286 | for (z = zonelist->_zonerefs; z->zone; z++) | 2298 | for (z = zonelist->_zonerefs; z->zone; z++) |
2287 | zlc->z_to_n[z - zonelist->_zonerefs] = zonelist_node_idx(z); | 2299 | zlc->z_to_n[z - zonelist->_zonerefs] = zonelist_node_idx(z); |
2288 | } | 2300 | } |
2289 | 2301 | ||
2290 | 2302 | ||
2291 | #else /* CONFIG_NUMA */ | 2303 | #else /* CONFIG_NUMA */ |
2292 | 2304 | ||
2293 | static void set_zonelist_order(void) | 2305 | static void set_zonelist_order(void) |
2294 | { | 2306 | { |
2295 | current_zonelist_order = ZONELIST_ORDER_ZONE; | 2307 | current_zonelist_order = ZONELIST_ORDER_ZONE; |
2296 | } | 2308 | } |
2297 | 2309 | ||
2298 | static void build_zonelists(pg_data_t *pgdat) | 2310 | static void build_zonelists(pg_data_t *pgdat) |
2299 | { | 2311 | { |
2300 | int node, local_node; | 2312 | int node, local_node; |
2301 | enum zone_type j; | 2313 | enum zone_type j; |
2302 | struct zonelist *zonelist; | 2314 | struct zonelist *zonelist; |
2303 | 2315 | ||
2304 | local_node = pgdat->node_id; | 2316 | local_node = pgdat->node_id; |
2305 | 2317 | ||
2306 | zonelist = &pgdat->node_zonelists[0]; | 2318 | zonelist = &pgdat->node_zonelists[0]; |
2307 | j = build_zonelists_node(pgdat, zonelist, 0, MAX_NR_ZONES - 1); | 2319 | j = build_zonelists_node(pgdat, zonelist, 0, MAX_NR_ZONES - 1); |
2308 | 2320 | ||
2309 | /* | 2321 | /* |
2310 | * Now we build the zonelist so that it contains the zones | 2322 | * Now we build the zonelist so that it contains the zones |
2311 | * of all the other nodes. | 2323 | * of all the other nodes. |
2312 | * We don't want to pressure a particular node, so when | 2324 | * We don't want to pressure a particular node, so when |
2313 | * building the zones for node N, we make sure that the | 2325 | * building the zones for node N, we make sure that the |
2314 | * zones coming right after the local ones are those from | 2326 | * zones coming right after the local ones are those from |
2315 | * node N+1 (modulo N) | 2327 | * node N+1 (modulo N) |
2316 | */ | 2328 | */ |
2317 | for (node = local_node + 1; node < MAX_NUMNODES; node++) { | 2329 | for (node = local_node + 1; node < MAX_NUMNODES; node++) { |
2318 | if (!node_online(node)) | 2330 | if (!node_online(node)) |
2319 | continue; | 2331 | continue; |
2320 | j = build_zonelists_node(NODE_DATA(node), zonelist, j, | 2332 | j = build_zonelists_node(NODE_DATA(node), zonelist, j, |
2321 | MAX_NR_ZONES - 1); | 2333 | MAX_NR_ZONES - 1); |
2322 | } | 2334 | } |
2323 | for (node = 0; node < local_node; node++) { | 2335 | for (node = 0; node < local_node; node++) { |
2324 | if (!node_online(node)) | 2336 | if (!node_online(node)) |
2325 | continue; | 2337 | continue; |
2326 | j = build_zonelists_node(NODE_DATA(node), zonelist, j, | 2338 | j = build_zonelists_node(NODE_DATA(node), zonelist, j, |
2327 | MAX_NR_ZONES - 1); | 2339 | MAX_NR_ZONES - 1); |
2328 | } | 2340 | } |
2329 | 2341 | ||
2330 | zonelist->_zonerefs[j].zone = NULL; | 2342 | zonelist->_zonerefs[j].zone = NULL; |
2331 | zonelist->_zonerefs[j].zone_idx = 0; | 2343 | zonelist->_zonerefs[j].zone_idx = 0; |
2332 | } | 2344 | } |
2333 | 2345 | ||
2334 | /* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */ | 2346 | /* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */ |
2335 | static void build_zonelist_cache(pg_data_t *pgdat) | 2347 | static void build_zonelist_cache(pg_data_t *pgdat) |
2336 | { | 2348 | { |
2337 | pgdat->node_zonelists[0].zlcache_ptr = NULL; | 2349 | pgdat->node_zonelists[0].zlcache_ptr = NULL; |
2338 | pgdat->node_zonelists[1].zlcache_ptr = NULL; | 2350 | pgdat->node_zonelists[1].zlcache_ptr = NULL; |
2339 | } | 2351 | } |
2340 | 2352 | ||
2341 | #endif /* CONFIG_NUMA */ | 2353 | #endif /* CONFIG_NUMA */ |
2342 | 2354 | ||
2343 | /* return values int ....just for stop_machine_run() */ | 2355 | /* return values int ....just for stop_machine_run() */ |
2344 | static int __build_all_zonelists(void *dummy) | 2356 | static int __build_all_zonelists(void *dummy) |
2345 | { | 2357 | { |
2346 | int nid; | 2358 | int nid; |
2347 | 2359 | ||
2348 | for_each_online_node(nid) { | 2360 | for_each_online_node(nid) { |
2349 | pg_data_t *pgdat = NODE_DATA(nid); | 2361 | pg_data_t *pgdat = NODE_DATA(nid); |
2350 | 2362 | ||
2351 | build_zonelists(pgdat); | 2363 | build_zonelists(pgdat); |
2352 | build_zonelist_cache(pgdat); | 2364 | build_zonelist_cache(pgdat); |
2353 | } | 2365 | } |
2354 | return 0; | 2366 | return 0; |
2355 | } | 2367 | } |
2356 | 2368 | ||
2357 | void build_all_zonelists(void) | 2369 | void build_all_zonelists(void) |
2358 | { | 2370 | { |
2359 | set_zonelist_order(); | 2371 | set_zonelist_order(); |
2360 | 2372 | ||
2361 | if (system_state == SYSTEM_BOOTING) { | 2373 | if (system_state == SYSTEM_BOOTING) { |
2362 | __build_all_zonelists(NULL); | 2374 | __build_all_zonelists(NULL); |
2363 | cpuset_init_current_mems_allowed(); | 2375 | cpuset_init_current_mems_allowed(); |
2364 | } else { | 2376 | } else { |
2365 | /* we have to stop all cpus to guarantee there is no user | 2377 | /* we have to stop all cpus to guarantee there is no user |
2366 | of zonelist */ | 2378 | of zonelist */ |
2367 | stop_machine_run(__build_all_zonelists, NULL, NR_CPUS); | 2379 | stop_machine_run(__build_all_zonelists, NULL, NR_CPUS); |
2368 | /* cpuset refresh routine should be here */ | 2380 | /* cpuset refresh routine should be here */ |
2369 | } | 2381 | } |
2370 | vm_total_pages = nr_free_pagecache_pages(); | 2382 | vm_total_pages = nr_free_pagecache_pages(); |
2371 | /* | 2383 | /* |
2372 | * Disable grouping by mobility if the number of pages in the | 2384 | * Disable grouping by mobility if the number of pages in the |
2373 | * system is too low to allow the mechanism to work. It would be | 2385 | * system is too low to allow the mechanism to work. It would be |
2374 | * more accurate, but expensive to check per-zone. This check is | 2386 | * more accurate, but expensive to check per-zone. This check is |
2375 | * made on memory-hotadd so a system can start with mobility | 2387 | * made on memory-hotadd so a system can start with mobility |
2376 | * disabled and enable it later | 2388 | * disabled and enable it later |
2377 | */ | 2389 | */ |
2378 | if (vm_total_pages < (pageblock_nr_pages * MIGRATE_TYPES)) | 2390 | if (vm_total_pages < (pageblock_nr_pages * MIGRATE_TYPES)) |
2379 | page_group_by_mobility_disabled = 1; | 2391 | page_group_by_mobility_disabled = 1; |
2380 | else | 2392 | else |
2381 | page_group_by_mobility_disabled = 0; | 2393 | page_group_by_mobility_disabled = 0; |
2382 | 2394 | ||
2383 | printk("Built %i zonelists in %s order, mobility grouping %s. " | 2395 | printk("Built %i zonelists in %s order, mobility grouping %s. " |
2384 | "Total pages: %ld\n", | 2396 | "Total pages: %ld\n", |
2385 | num_online_nodes(), | 2397 | num_online_nodes(), |
2386 | zonelist_order_name[current_zonelist_order], | 2398 | zonelist_order_name[current_zonelist_order], |
2387 | page_group_by_mobility_disabled ? "off" : "on", | 2399 | page_group_by_mobility_disabled ? "off" : "on", |
2388 | vm_total_pages); | 2400 | vm_total_pages); |
2389 | #ifdef CONFIG_NUMA | 2401 | #ifdef CONFIG_NUMA |
2390 | printk("Policy zone: %s\n", zone_names[policy_zone]); | 2402 | printk("Policy zone: %s\n", zone_names[policy_zone]); |
2391 | #endif | 2403 | #endif |
2392 | } | 2404 | } |
2393 | 2405 | ||
2394 | /* | 2406 | /* |
2395 | * Helper functions to size the waitqueue hash table. | 2407 | * Helper functions to size the waitqueue hash table. |
2396 | * Essentially these want to choose hash table sizes sufficiently | 2408 | * Essentially these want to choose hash table sizes sufficiently |
2397 | * large so that collisions trying to wait on pages are rare. | 2409 | * large so that collisions trying to wait on pages are rare. |
2398 | * But in fact, the number of active page waitqueues on typical | 2410 | * But in fact, the number of active page waitqueues on typical |
2399 | * systems is ridiculously low, less than 200. So this is even | 2411 | * systems is ridiculously low, less than 200. So this is even |
2400 | * conservative, even though it seems large. | 2412 | * conservative, even though it seems large. |
2401 | * | 2413 | * |
2402 | * The constant PAGES_PER_WAITQUEUE specifies the ratio of pages to | 2414 | * The constant PAGES_PER_WAITQUEUE specifies the ratio of pages to |
2403 | * waitqueues, i.e. the size of the waitq table given the number of pages. | 2415 | * waitqueues, i.e. the size of the waitq table given the number of pages. |
2404 | */ | 2416 | */ |
2405 | #define PAGES_PER_WAITQUEUE 256 | 2417 | #define PAGES_PER_WAITQUEUE 256 |
2406 | 2418 | ||
2407 | #ifndef CONFIG_MEMORY_HOTPLUG | 2419 | #ifndef CONFIG_MEMORY_HOTPLUG |
2408 | static inline unsigned long wait_table_hash_nr_entries(unsigned long pages) | 2420 | static inline unsigned long wait_table_hash_nr_entries(unsigned long pages) |
2409 | { | 2421 | { |
2410 | unsigned long size = 1; | 2422 | unsigned long size = 1; |
2411 | 2423 | ||
2412 | pages /= PAGES_PER_WAITQUEUE; | 2424 | pages /= PAGES_PER_WAITQUEUE; |
2413 | 2425 | ||
2414 | while (size < pages) | 2426 | while (size < pages) |
2415 | size <<= 1; | 2427 | size <<= 1; |
2416 | 2428 | ||
2417 | /* | 2429 | /* |
2418 | * Once we have dozens or even hundreds of threads sleeping | 2430 | * Once we have dozens or even hundreds of threads sleeping |
2419 | * on IO we've got bigger problems than wait queue collision. | 2431 | * on IO we've got bigger problems than wait queue collision. |
2420 | * Limit the size of the wait table to a reasonable size. | 2432 | * Limit the size of the wait table to a reasonable size. |
2421 | */ | 2433 | */ |
2422 | size = min(size, 4096UL); | 2434 | size = min(size, 4096UL); |
2423 | 2435 | ||
2424 | return max(size, 4UL); | 2436 | return max(size, 4UL); |
2425 | } | 2437 | } |
2426 | #else | 2438 | #else |
2427 | /* | 2439 | /* |
2428 | * A zone's size might be changed by hot-add, so it is not possible to determine | 2440 | * A zone's size might be changed by hot-add, so it is not possible to determine |
2429 | * a suitable size for its wait_table. So we use the maximum size now. | 2441 | * a suitable size for its wait_table. So we use the maximum size now. |
2430 | * | 2442 | * |
2431 | * The max wait table size = 4096 x sizeof(wait_queue_head_t). ie: | 2443 | * The max wait table size = 4096 x sizeof(wait_queue_head_t). ie: |
2432 | * | 2444 | * |
2433 | * i386 (preemption config) : 4096 x 16 = 64Kbyte. | 2445 | * i386 (preemption config) : 4096 x 16 = 64Kbyte. |
2434 | * ia64, x86-64 (no preemption): 4096 x 20 = 80Kbyte. | 2446 | * ia64, x86-64 (no preemption): 4096 x 20 = 80Kbyte. |
2435 | * ia64, x86-64 (preemption) : 4096 x 24 = 96Kbyte. | 2447 | * ia64, x86-64 (preemption) : 4096 x 24 = 96Kbyte. |
2436 | * | 2448 | * |
2437 | * The maximum entries are prepared when a zone's memory is (512K + 256) pages | 2449 | * The maximum entries are prepared when a zone's memory is (512K + 256) pages |
2438 | * or more by the traditional way. (See above). It equals: | 2450 | * or more by the traditional way. (See above). It equals: |
2439 | * | 2451 | * |
2440 | * i386, x86-64, powerpc(4K page size) : = ( 2G + 1M)byte. | 2452 | * i386, x86-64, powerpc(4K page size) : = ( 2G + 1M)byte. |
2441 | * ia64(16K page size) : = ( 8G + 4M)byte. | 2453 | * ia64(16K page size) : = ( 8G + 4M)byte. |
2442 | * powerpc (64K page size) : = (32G +16M)byte. | 2454 | * powerpc (64K page size) : = (32G +16M)byte. |
2443 | */ | 2455 | */ |
2444 | static inline unsigned long wait_table_hash_nr_entries(unsigned long pages) | 2456 | static inline unsigned long wait_table_hash_nr_entries(unsigned long pages) |
2445 | { | 2457 | { |
2446 | return 4096UL; | 2458 | return 4096UL; |
2447 | } | 2459 | } |
2448 | #endif | 2460 | #endif |
2449 | 2461 | ||
2450 | /* | 2462 | /* |
2451 | * This is an integer logarithm so that shifts can be used later | 2463 | * This is an integer logarithm so that shifts can be used later |
2452 | * to extract the more random high bits from the multiplicative | 2464 | * to extract the more random high bits from the multiplicative |
2453 | * hash function before the remainder is taken. | 2465 | * hash function before the remainder is taken. |
2454 | */ | 2466 | */ |
2455 | static inline unsigned long wait_table_bits(unsigned long size) | 2467 | static inline unsigned long wait_table_bits(unsigned long size) |
2456 | { | 2468 | { |
2457 | return ffz(~size); | 2469 | return ffz(~size); |
2458 | } | 2470 | } |
2459 | 2471 | ||
2460 | #define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1)) | 2472 | #define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1)) |
2461 | 2473 | ||
2462 | /* | 2474 | /* |
2463 | * Mark a number of pageblocks as MIGRATE_RESERVE. The number | 2475 | * Mark a number of pageblocks as MIGRATE_RESERVE. The number |
2464 | * of blocks reserved is based on zone->pages_min. The memory within the | 2476 | * of blocks reserved is based on zone->pages_min. The memory within the |
2465 | * reserve will tend to store contiguous free pages. Setting min_free_kbytes | 2477 | * reserve will tend to store contiguous free pages. Setting min_free_kbytes |
2466 | * higher will lead to a bigger reserve which will get freed as contiguous | 2478 | * higher will lead to a bigger reserve which will get freed as contiguous |
2467 | * blocks as reclaim kicks in | 2479 | * blocks as reclaim kicks in |
2468 | */ | 2480 | */ |
2469 | static void setup_zone_migrate_reserve(struct zone *zone) | 2481 | static void setup_zone_migrate_reserve(struct zone *zone) |
2470 | { | 2482 | { |
2471 | unsigned long start_pfn, pfn, end_pfn; | 2483 | unsigned long start_pfn, pfn, end_pfn; |
2472 | struct page *page; | 2484 | struct page *page; |
2473 | unsigned long reserve, block_migratetype; | 2485 | unsigned long reserve, block_migratetype; |
2474 | 2486 | ||
2475 | /* Get the start pfn, end pfn and the number of blocks to reserve */ | 2487 | /* Get the start pfn, end pfn and the number of blocks to reserve */ |
2476 | start_pfn = zone->zone_start_pfn; | 2488 | start_pfn = zone->zone_start_pfn; |
2477 | end_pfn = start_pfn + zone->spanned_pages; | 2489 | end_pfn = start_pfn + zone->spanned_pages; |
2478 | reserve = roundup(zone->pages_min, pageblock_nr_pages) >> | 2490 | reserve = roundup(zone->pages_min, pageblock_nr_pages) >> |
2479 | pageblock_order; | 2491 | pageblock_order; |
2480 | 2492 | ||
2481 | for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) { | 2493 | for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) { |
2482 | if (!pfn_valid(pfn)) | 2494 | if (!pfn_valid(pfn)) |
2483 | continue; | 2495 | continue; |
2484 | page = pfn_to_page(pfn); | 2496 | page = pfn_to_page(pfn); |
2485 | 2497 | ||
2486 | /* Blocks with reserved pages will never free, skip them. */ | 2498 | /* Blocks with reserved pages will never free, skip them. */ |
2487 | if (PageReserved(page)) | 2499 | if (PageReserved(page)) |
2488 | continue; | 2500 | continue; |
2489 | 2501 | ||
2490 | block_migratetype = get_pageblock_migratetype(page); | 2502 | block_migratetype = get_pageblock_migratetype(page); |
2491 | 2503 | ||
2492 | /* If this block is reserved, account for it */ | 2504 | /* If this block is reserved, account for it */ |
2493 | if (reserve > 0 && block_migratetype == MIGRATE_RESERVE) { | 2505 | if (reserve > 0 && block_migratetype == MIGRATE_RESERVE) { |
2494 | reserve--; | 2506 | reserve--; |
2495 | continue; | 2507 | continue; |
2496 | } | 2508 | } |
2497 | 2509 | ||
2498 | /* Suitable for reserving if this block is movable */ | 2510 | /* Suitable for reserving if this block is movable */ |
2499 | if (reserve > 0 && block_migratetype == MIGRATE_MOVABLE) { | 2511 | if (reserve > 0 && block_migratetype == MIGRATE_MOVABLE) { |
2500 | set_pageblock_migratetype(page, MIGRATE_RESERVE); | 2512 | set_pageblock_migratetype(page, MIGRATE_RESERVE); |
2501 | move_freepages_block(zone, page, MIGRATE_RESERVE); | 2513 | move_freepages_block(zone, page, MIGRATE_RESERVE); |
2502 | reserve--; | 2514 | reserve--; |
2503 | continue; | 2515 | continue; |
2504 | } | 2516 | } |
2505 | 2517 | ||
2506 | /* | 2518 | /* |
2507 | * If the reserve is met and this is a previous reserved block, | 2519 | * If the reserve is met and this is a previous reserved block, |
2508 | * take it back | 2520 | * take it back |
2509 | */ | 2521 | */ |
2510 | if (block_migratetype == MIGRATE_RESERVE) { | 2522 | if (block_migratetype == MIGRATE_RESERVE) { |
2511 | set_pageblock_migratetype(page, MIGRATE_MOVABLE); | 2523 | set_pageblock_migratetype(page, MIGRATE_MOVABLE); |
2512 | move_freepages_block(zone, page, MIGRATE_MOVABLE); | 2524 | move_freepages_block(zone, page, MIGRATE_MOVABLE); |
2513 | } | 2525 | } |
2514 | } | 2526 | } |
2515 | } | 2527 | } |
2516 | 2528 | ||
2517 | /* | 2529 | /* |
2518 | * Initially all pages are reserved - free ones are freed | 2530 | * Initially all pages are reserved - free ones are freed |
2519 | * up by free_all_bootmem() once the early boot process is | 2531 | * up by free_all_bootmem() once the early boot process is |
2520 | * done. Non-atomic initialization, single-pass. | 2532 | * done. Non-atomic initialization, single-pass. |
2521 | */ | 2533 | */ |
2522 | void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, | 2534 | void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, |
2523 | unsigned long start_pfn, enum memmap_context context) | 2535 | unsigned long start_pfn, enum memmap_context context) |
2524 | { | 2536 | { |
2525 | struct page *page; | 2537 | struct page *page; |
2526 | unsigned long end_pfn = start_pfn + size; | 2538 | unsigned long end_pfn = start_pfn + size; |
2527 | unsigned long pfn; | 2539 | unsigned long pfn; |
2528 | struct zone *z; | 2540 | struct zone *z; |
2529 | 2541 | ||
2530 | z = &NODE_DATA(nid)->node_zones[zone]; | 2542 | z = &NODE_DATA(nid)->node_zones[zone]; |
2531 | for (pfn = start_pfn; pfn < end_pfn; pfn++) { | 2543 | for (pfn = start_pfn; pfn < end_pfn; pfn++) { |
2532 | /* | 2544 | /* |
2533 | * There can be holes in boot-time mem_map[]s | 2545 | * There can be holes in boot-time mem_map[]s |
2534 | * handed to this function. They do not | 2546 | * handed to this function. They do not |
2535 | * exist on hotplugged memory. | 2547 | * exist on hotplugged memory. |
2536 | */ | 2548 | */ |
2537 | if (context == MEMMAP_EARLY) { | 2549 | if (context == MEMMAP_EARLY) { |
2538 | if (!early_pfn_valid(pfn)) | 2550 | if (!early_pfn_valid(pfn)) |
2539 | continue; | 2551 | continue; |
2540 | if (!early_pfn_in_nid(pfn, nid)) | 2552 | if (!early_pfn_in_nid(pfn, nid)) |
2541 | continue; | 2553 | continue; |
2542 | } | 2554 | } |
2543 | page = pfn_to_page(pfn); | 2555 | page = pfn_to_page(pfn); |
2544 | set_page_links(page, zone, nid, pfn); | 2556 | set_page_links(page, zone, nid, pfn); |
2545 | init_page_count(page); | 2557 | init_page_count(page); |
2546 | reset_page_mapcount(page); | 2558 | reset_page_mapcount(page); |
2547 | SetPageReserved(page); | 2559 | SetPageReserved(page); |
2548 | /* | 2560 | /* |
2549 | * Mark the block movable so that blocks are reserved for | 2561 | * Mark the block movable so that blocks are reserved for |
2550 | * movable at startup. This will force kernel allocations | 2562 | * movable at startup. This will force kernel allocations |
2551 | * to reserve their blocks rather than leaking throughout | 2563 | * to reserve their blocks rather than leaking throughout |
2552 | * the address space during boot when many long-lived | 2564 | * the address space during boot when many long-lived |
2553 | * kernel allocations are made. Later some blocks near | 2565 | * kernel allocations are made. Later some blocks near |
2554 | * the start are marked MIGRATE_RESERVE by | 2566 | * the start are marked MIGRATE_RESERVE by |
2555 | * setup_zone_migrate_reserve() | 2567 | * setup_zone_migrate_reserve() |
2556 | * | 2568 | * |
2557 | * bitmap is created for zone's valid pfn range. but memmap | 2569 | * bitmap is created for zone's valid pfn range. but memmap |
2558 | * can be created for invalid pages (for alignment) | 2570 | * can be created for invalid pages (for alignment) |
2559 | * check here not to call set_pageblock_migratetype() against | 2571 | * check here not to call set_pageblock_migratetype() against |
2560 | * pfn out of zone. | 2572 | * pfn out of zone. |
2561 | */ | 2573 | */ |
2562 | if ((z->zone_start_pfn <= pfn) | 2574 | if ((z->zone_start_pfn <= pfn) |
2563 | && (pfn < z->zone_start_pfn + z->spanned_pages) | 2575 | && (pfn < z->zone_start_pfn + z->spanned_pages) |
2564 | && !(pfn & (pageblock_nr_pages - 1))) | 2576 | && !(pfn & (pageblock_nr_pages - 1))) |
2565 | set_pageblock_migratetype(page, MIGRATE_MOVABLE); | 2577 | set_pageblock_migratetype(page, MIGRATE_MOVABLE); |
2566 | 2578 | ||
2567 | INIT_LIST_HEAD(&page->lru); | 2579 | INIT_LIST_HEAD(&page->lru); |
2568 | #ifdef WANT_PAGE_VIRTUAL | 2580 | #ifdef WANT_PAGE_VIRTUAL |
2569 | /* The shift won't overflow because ZONE_NORMAL is below 4G. */ | 2581 | /* The shift won't overflow because ZONE_NORMAL is below 4G. */ |
2570 | if (!is_highmem_idx(zone)) | 2582 | if (!is_highmem_idx(zone)) |
2571 | set_page_address(page, __va(pfn << PAGE_SHIFT)); | 2583 | set_page_address(page, __va(pfn << PAGE_SHIFT)); |
2572 | #endif | 2584 | #endif |
2573 | } | 2585 | } |
2574 | } | 2586 | } |
2575 | 2587 | ||
2576 | static void __meminit zone_init_free_lists(struct zone *zone) | 2588 | static void __meminit zone_init_free_lists(struct zone *zone) |
2577 | { | 2589 | { |
2578 | int order, t; | 2590 | int order, t; |
2579 | for_each_migratetype_order(order, t) { | 2591 | for_each_migratetype_order(order, t) { |
2580 | INIT_LIST_HEAD(&zone->free_area[order].free_list[t]); | 2592 | INIT_LIST_HEAD(&zone->free_area[order].free_list[t]); |
2581 | zone->free_area[order].nr_free = 0; | 2593 | zone->free_area[order].nr_free = 0; |
2582 | } | 2594 | } |
2583 | } | 2595 | } |
2584 | 2596 | ||
2585 | #ifndef __HAVE_ARCH_MEMMAP_INIT | 2597 | #ifndef __HAVE_ARCH_MEMMAP_INIT |
2586 | #define memmap_init(size, nid, zone, start_pfn) \ | 2598 | #define memmap_init(size, nid, zone, start_pfn) \ |
2587 | memmap_init_zone((size), (nid), (zone), (start_pfn), MEMMAP_EARLY) | 2599 | memmap_init_zone((size), (nid), (zone), (start_pfn), MEMMAP_EARLY) |
2588 | #endif | 2600 | #endif |
2589 | 2601 | ||
2590 | static int zone_batchsize(struct zone *zone) | 2602 | static int zone_batchsize(struct zone *zone) |
2591 | { | 2603 | { |
2592 | int batch; | 2604 | int batch; |
2593 | 2605 | ||
2594 | /* | 2606 | /* |
2595 | * The per-cpu-pages pools are set to around 1000th of the | 2607 | * The per-cpu-pages pools are set to around 1000th of the |
2596 | * size of the zone. But no more than 1/2 of a meg. | 2608 | * size of the zone. But no more than 1/2 of a meg. |
2597 | * | 2609 | * |
2598 | * OK, so we don't know how big the cache is. So guess. | 2610 | * OK, so we don't know how big the cache is. So guess. |
2599 | */ | 2611 | */ |
2600 | batch = zone->present_pages / 1024; | 2612 | batch = zone->present_pages / 1024; |
2601 | if (batch * PAGE_SIZE > 512 * 1024) | 2613 | if (batch * PAGE_SIZE > 512 * 1024) |
2602 | batch = (512 * 1024) / PAGE_SIZE; | 2614 | batch = (512 * 1024) / PAGE_SIZE; |
2603 | batch /= 4; /* We effectively *= 4 below */ | 2615 | batch /= 4; /* We effectively *= 4 below */ |
2604 | if (batch < 1) | 2616 | if (batch < 1) |
2605 | batch = 1; | 2617 | batch = 1; |
2606 | 2618 | ||
2607 | /* | 2619 | /* |
2608 | * Clamp the batch to a 2^n - 1 value. Having a power | 2620 | * Clamp the batch to a 2^n - 1 value. Having a power |
2609 | * of 2 value was found to be more likely to have | 2621 | * of 2 value was found to be more likely to have |
2610 | * suboptimal cache aliasing properties in some cases. | 2622 | * suboptimal cache aliasing properties in some cases. |
2611 | * | 2623 | * |
2612 | * For example if 2 tasks are alternately allocating | 2624 | * For example if 2 tasks are alternately allocating |
2613 | * batches of pages, one task can end up with a lot | 2625 | * batches of pages, one task can end up with a lot |
2614 | * of pages of one half of the possible page colors | 2626 | * of pages of one half of the possible page colors |
2615 | * and the other with pages of the other colors. | 2627 | * and the other with pages of the other colors. |
2616 | */ | 2628 | */ |
2617 | batch = (1 << (fls(batch + batch/2)-1)) - 1; | 2629 | batch = (1 << (fls(batch + batch/2)-1)) - 1; |
2618 | 2630 | ||
2619 | return batch; | 2631 | return batch; |
2620 | } | 2632 | } |
2621 | 2633 | ||
2622 | inline void setup_pageset(struct per_cpu_pageset *p, unsigned long batch) | 2634 | inline void setup_pageset(struct per_cpu_pageset *p, unsigned long batch) |
2623 | { | 2635 | { |
2624 | struct per_cpu_pages *pcp; | 2636 | struct per_cpu_pages *pcp; |
2625 | 2637 | ||
2626 | memset(p, 0, sizeof(*p)); | 2638 | memset(p, 0, sizeof(*p)); |
2627 | 2639 | ||
2628 | pcp = &p->pcp; | 2640 | pcp = &p->pcp; |
2629 | pcp->count = 0; | 2641 | pcp->count = 0; |
2630 | pcp->high = 6 * batch; | 2642 | pcp->high = 6 * batch; |
2631 | pcp->batch = max(1UL, 1 * batch); | 2643 | pcp->batch = max(1UL, 1 * batch); |
2632 | INIT_LIST_HEAD(&pcp->list); | 2644 | INIT_LIST_HEAD(&pcp->list); |
2633 | } | 2645 | } |
2634 | 2646 | ||
2635 | /* | 2647 | /* |
2636 | * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist | 2648 | * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist |
2637 | * to the value high for the pageset p. | 2649 | * to the value high for the pageset p. |
2638 | */ | 2650 | */ |
2639 | 2651 | ||
2640 | static void setup_pagelist_highmark(struct per_cpu_pageset *p, | 2652 | static void setup_pagelist_highmark(struct per_cpu_pageset *p, |
2641 | unsigned long high) | 2653 | unsigned long high) |
2642 | { | 2654 | { |
2643 | struct per_cpu_pages *pcp; | 2655 | struct per_cpu_pages *pcp; |
2644 | 2656 | ||
2645 | pcp = &p->pcp; | 2657 | pcp = &p->pcp; |
2646 | pcp->high = high; | 2658 | pcp->high = high; |
2647 | pcp->batch = max(1UL, high/4); | 2659 | pcp->batch = max(1UL, high/4); |
2648 | if ((high/4) > (PAGE_SHIFT * 8)) | 2660 | if ((high/4) > (PAGE_SHIFT * 8)) |
2649 | pcp->batch = PAGE_SHIFT * 8; | 2661 | pcp->batch = PAGE_SHIFT * 8; |
2650 | } | 2662 | } |
2651 | 2663 | ||
2652 | 2664 | ||
2653 | #ifdef CONFIG_NUMA | 2665 | #ifdef CONFIG_NUMA |
2654 | /* | 2666 | /* |
2655 | * Boot pageset table. One per cpu which is going to be used for all | 2667 | * Boot pageset table. One per cpu which is going to be used for all |
2656 | * zones and all nodes. The parameters will be set in such a way | 2668 | * zones and all nodes. The parameters will be set in such a way |
2657 | * that an item put on a list will immediately be handed over to | 2669 | * that an item put on a list will immediately be handed over to |
2658 | * the buddy list. This is safe since pageset manipulation is done | 2670 | * the buddy list. This is safe since pageset manipulation is done |
2659 | * with interrupts disabled. | 2671 | * with interrupts disabled. |
2660 | * | 2672 | * |
2661 | * Some NUMA counter updates may also be caught by the boot pagesets. | 2673 | * Some NUMA counter updates may also be caught by the boot pagesets. |
2662 | * | 2674 | * |
2663 | * The boot_pagesets must be kept even after bootup is complete for | 2675 | * The boot_pagesets must be kept even after bootup is complete for |
2664 | * unused processors and/or zones. They do play a role for bootstrapping | 2676 | * unused processors and/or zones. They do play a role for bootstrapping |
2665 | * hotplugged processors. | 2677 | * hotplugged processors. |
2666 | * | 2678 | * |
2667 | * zoneinfo_show() and maybe other functions do | 2679 | * zoneinfo_show() and maybe other functions do |
2668 | * not check if the processor is online before following the pageset pointer. | 2680 | * not check if the processor is online before following the pageset pointer. |
2669 | * Other parts of the kernel may not check if the zone is available. | 2681 | * Other parts of the kernel may not check if the zone is available. |
2670 | */ | 2682 | */ |
2671 | static struct per_cpu_pageset boot_pageset[NR_CPUS]; | 2683 | static struct per_cpu_pageset boot_pageset[NR_CPUS]; |
2672 | 2684 | ||
2673 | /* | 2685 | /* |
2674 | * Dynamically allocate memory for the | 2686 | * Dynamically allocate memory for the |
2675 | * per cpu pageset array in struct zone. | 2687 | * per cpu pageset array in struct zone. |
2676 | */ | 2688 | */ |
2677 | static int __cpuinit process_zones(int cpu) | 2689 | static int __cpuinit process_zones(int cpu) |
2678 | { | 2690 | { |
2679 | struct zone *zone, *dzone; | 2691 | struct zone *zone, *dzone; |
2680 | int node = cpu_to_node(cpu); | 2692 | int node = cpu_to_node(cpu); |
2681 | 2693 | ||
2682 | node_set_state(node, N_CPU); /* this node has a cpu */ | 2694 | node_set_state(node, N_CPU); /* this node has a cpu */ |
2683 | 2695 | ||
2684 | for_each_zone(zone) { | 2696 | for_each_zone(zone) { |
2685 | 2697 | ||
2686 | if (!populated_zone(zone)) | 2698 | if (!populated_zone(zone)) |
2687 | continue; | 2699 | continue; |
2688 | 2700 | ||
2689 | zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset), | 2701 | zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset), |
2690 | GFP_KERNEL, node); | 2702 | GFP_KERNEL, node); |
2691 | if (!zone_pcp(zone, cpu)) | 2703 | if (!zone_pcp(zone, cpu)) |
2692 | goto bad; | 2704 | goto bad; |
2693 | 2705 | ||
2694 | setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone)); | 2706 | setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone)); |
2695 | 2707 | ||
2696 | if (percpu_pagelist_fraction) | 2708 | if (percpu_pagelist_fraction) |
2697 | setup_pagelist_highmark(zone_pcp(zone, cpu), | 2709 | setup_pagelist_highmark(zone_pcp(zone, cpu), |
2698 | (zone->present_pages / percpu_pagelist_fraction)); | 2710 | (zone->present_pages / percpu_pagelist_fraction)); |
2699 | } | 2711 | } |
2700 | 2712 | ||
2701 | return 0; | 2713 | return 0; |
2702 | bad: | 2714 | bad: |
2703 | for_each_zone(dzone) { | 2715 | for_each_zone(dzone) { |
2704 | if (!populated_zone(dzone)) | 2716 | if (!populated_zone(dzone)) |
2705 | continue; | 2717 | continue; |
2706 | if (dzone == zone) | 2718 | if (dzone == zone) |
2707 | break; | 2719 | break; |
2708 | kfree(zone_pcp(dzone, cpu)); | 2720 | kfree(zone_pcp(dzone, cpu)); |
2709 | zone_pcp(dzone, cpu) = NULL; | 2721 | zone_pcp(dzone, cpu) = NULL; |
2710 | } | 2722 | } |
2711 | return -ENOMEM; | 2723 | return -ENOMEM; |
2712 | } | 2724 | } |
2713 | 2725 | ||
2714 | static inline void free_zone_pagesets(int cpu) | 2726 | static inline void free_zone_pagesets(int cpu) |
2715 | { | 2727 | { |
2716 | struct zone *zone; | 2728 | struct zone *zone; |
2717 | 2729 | ||
2718 | for_each_zone(zone) { | 2730 | for_each_zone(zone) { |
2719 | struct per_cpu_pageset *pset = zone_pcp(zone, cpu); | 2731 | struct per_cpu_pageset *pset = zone_pcp(zone, cpu); |
2720 | 2732 | ||
2721 | /* Free per_cpu_pageset if it is slab allocated */ | 2733 | /* Free per_cpu_pageset if it is slab allocated */ |
2722 | if (pset != &boot_pageset[cpu]) | 2734 | if (pset != &boot_pageset[cpu]) |
2723 | kfree(pset); | 2735 | kfree(pset); |
2724 | zone_pcp(zone, cpu) = NULL; | 2736 | zone_pcp(zone, cpu) = NULL; |
2725 | } | 2737 | } |
2726 | } | 2738 | } |
2727 | 2739 | ||
2728 | static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb, | 2740 | static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb, |
2729 | unsigned long action, | 2741 | unsigned long action, |
2730 | void *hcpu) | 2742 | void *hcpu) |
2731 | { | 2743 | { |
2732 | int cpu = (long)hcpu; | 2744 | int cpu = (long)hcpu; |
2733 | int ret = NOTIFY_OK; | 2745 | int ret = NOTIFY_OK; |
2734 | 2746 | ||
2735 | switch (action) { | 2747 | switch (action) { |
2736 | case CPU_UP_PREPARE: | 2748 | case CPU_UP_PREPARE: |
2737 | case CPU_UP_PREPARE_FROZEN: | 2749 | case CPU_UP_PREPARE_FROZEN: |
2738 | if (process_zones(cpu)) | 2750 | if (process_zones(cpu)) |
2739 | ret = NOTIFY_BAD; | 2751 | ret = NOTIFY_BAD; |
2740 | break; | 2752 | break; |
2741 | case CPU_UP_CANCELED: | 2753 | case CPU_UP_CANCELED: |
2742 | case CPU_UP_CANCELED_FROZEN: | 2754 | case CPU_UP_CANCELED_FROZEN: |
2743 | case CPU_DEAD: | 2755 | case CPU_DEAD: |
2744 | case CPU_DEAD_FROZEN: | 2756 | case CPU_DEAD_FROZEN: |
2745 | free_zone_pagesets(cpu); | 2757 | free_zone_pagesets(cpu); |
2746 | break; | 2758 | break; |
2747 | default: | 2759 | default: |
2748 | break; | 2760 | break; |
2749 | } | 2761 | } |
2750 | return ret; | 2762 | return ret; |
2751 | } | 2763 | } |
2752 | 2764 | ||
2753 | static struct notifier_block __cpuinitdata pageset_notifier = | 2765 | static struct notifier_block __cpuinitdata pageset_notifier = |
2754 | { &pageset_cpuup_callback, NULL, 0 }; | 2766 | { &pageset_cpuup_callback, NULL, 0 }; |
2755 | 2767 | ||
2756 | void __init setup_per_cpu_pageset(void) | 2768 | void __init setup_per_cpu_pageset(void) |
2757 | { | 2769 | { |
2758 | int err; | 2770 | int err; |
2759 | 2771 | ||
2760 | /* Initialize per_cpu_pageset for cpu 0. | 2772 | /* Initialize per_cpu_pageset for cpu 0. |
2761 | * A cpuup callback will do this for every cpu | 2773 | * A cpuup callback will do this for every cpu |
2762 | * as it comes online | 2774 | * as it comes online |
2763 | */ | 2775 | */ |
2764 | err = process_zones(smp_processor_id()); | 2776 | err = process_zones(smp_processor_id()); |
2765 | BUG_ON(err); | 2777 | BUG_ON(err); |
2766 | register_cpu_notifier(&pageset_notifier); | 2778 | register_cpu_notifier(&pageset_notifier); |
2767 | } | 2779 | } |
2768 | 2780 | ||
2769 | #endif | 2781 | #endif |
2770 | 2782 | ||
2771 | static noinline __init_refok | 2783 | static noinline __init_refok |
2772 | int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages) | 2784 | int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages) |
2773 | { | 2785 | { |
2774 | int i; | 2786 | int i; |
2775 | struct pglist_data *pgdat = zone->zone_pgdat; | 2787 | struct pglist_data *pgdat = zone->zone_pgdat; |
2776 | size_t alloc_size; | 2788 | size_t alloc_size; |
2777 | 2789 | ||
2778 | /* | 2790 | /* |
2779 | * The per-page waitqueue mechanism uses hashed waitqueues | 2791 | * The per-page waitqueue mechanism uses hashed waitqueues |
2780 | * per zone. | 2792 | * per zone. |
2781 | */ | 2793 | */ |
2782 | zone->wait_table_hash_nr_entries = | 2794 | zone->wait_table_hash_nr_entries = |
2783 | wait_table_hash_nr_entries(zone_size_pages); | 2795 | wait_table_hash_nr_entries(zone_size_pages); |
2784 | zone->wait_table_bits = | 2796 | zone->wait_table_bits = |
2785 | wait_table_bits(zone->wait_table_hash_nr_entries); | 2797 | wait_table_bits(zone->wait_table_hash_nr_entries); |
2786 | alloc_size = zone->wait_table_hash_nr_entries | 2798 | alloc_size = zone->wait_table_hash_nr_entries |
2787 | * sizeof(wait_queue_head_t); | 2799 | * sizeof(wait_queue_head_t); |
2788 | 2800 | ||
2789 | if (system_state == SYSTEM_BOOTING) { | 2801 | if (system_state == SYSTEM_BOOTING) { |
2790 | zone->wait_table = (wait_queue_head_t *) | 2802 | zone->wait_table = (wait_queue_head_t *) |
2791 | alloc_bootmem_node(pgdat, alloc_size); | 2803 | alloc_bootmem_node(pgdat, alloc_size); |
2792 | } else { | 2804 | } else { |
2793 | /* | 2805 | /* |
2794 | * This case means that a zone whose size was 0 gets new memory | 2806 | * This case means that a zone whose size was 0 gets new memory |
2795 | * via memory hot-add. | 2807 | * via memory hot-add. |
2796 | * But it may be the case that a new node was hot-added. In | 2808 | * But it may be the case that a new node was hot-added. In |
2797 | * this case vmalloc() will not be able to use this new node's | 2809 | * this case vmalloc() will not be able to use this new node's |
2798 | * memory - this wait_table must be initialized to use this new | 2810 | * memory - this wait_table must be initialized to use this new |
2799 | * node itself as well. | 2811 | * node itself as well. |
2800 | * To use this new node's memory, further consideration will be | 2812 | * To use this new node's memory, further consideration will be |
2801 | * necessary. | 2813 | * necessary. |
2802 | */ | 2814 | */ |
2803 | zone->wait_table = vmalloc(alloc_size); | 2815 | zone->wait_table = vmalloc(alloc_size); |
2804 | } | 2816 | } |
2805 | if (!zone->wait_table) | 2817 | if (!zone->wait_table) |
2806 | return -ENOMEM; | 2818 | return -ENOMEM; |
2807 | 2819 | ||
2808 | for(i = 0; i < zone->wait_table_hash_nr_entries; ++i) | 2820 | for(i = 0; i < zone->wait_table_hash_nr_entries; ++i) |
2809 | init_waitqueue_head(zone->wait_table + i); | 2821 | init_waitqueue_head(zone->wait_table + i); |
2810 | 2822 | ||
2811 | return 0; | 2823 | return 0; |
2812 | } | 2824 | } |
2813 | 2825 | ||
2814 | static __meminit void zone_pcp_init(struct zone *zone) | 2826 | static __meminit void zone_pcp_init(struct zone *zone) |
2815 | { | 2827 | { |
2816 | int cpu; | 2828 | int cpu; |
2817 | unsigned long batch = zone_batchsize(zone); | 2829 | unsigned long batch = zone_batchsize(zone); |
2818 | 2830 | ||
2819 | for (cpu = 0; cpu < NR_CPUS; cpu++) { | 2831 | for (cpu = 0; cpu < NR_CPUS; cpu++) { |
2820 | #ifdef CONFIG_NUMA | 2832 | #ifdef CONFIG_NUMA |
2821 | /* Early boot. Slab allocator not functional yet */ | 2833 | /* Early boot. Slab allocator not functional yet */ |
2822 | zone_pcp(zone, cpu) = &boot_pageset[cpu]; | 2834 | zone_pcp(zone, cpu) = &boot_pageset[cpu]; |
2823 | setup_pageset(&boot_pageset[cpu],0); | 2835 | setup_pageset(&boot_pageset[cpu],0); |
2824 | #else | 2836 | #else |
2825 | setup_pageset(zone_pcp(zone,cpu), batch); | 2837 | setup_pageset(zone_pcp(zone,cpu), batch); |
2826 | #endif | 2838 | #endif |
2827 | } | 2839 | } |
2828 | if (zone->present_pages) | 2840 | if (zone->present_pages) |
2829 | printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%lu\n", | 2841 | printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%lu\n", |
2830 | zone->name, zone->present_pages, batch); | 2842 | zone->name, zone->present_pages, batch); |
2831 | } | 2843 | } |
2832 | 2844 | ||
2833 | __meminit int init_currently_empty_zone(struct zone *zone, | 2845 | __meminit int init_currently_empty_zone(struct zone *zone, |
2834 | unsigned long zone_start_pfn, | 2846 | unsigned long zone_start_pfn, |
2835 | unsigned long size, | 2847 | unsigned long size, |
2836 | enum memmap_context context) | 2848 | enum memmap_context context) |
2837 | { | 2849 | { |
2838 | struct pglist_data *pgdat = zone->zone_pgdat; | 2850 | struct pglist_data *pgdat = zone->zone_pgdat; |
2839 | int ret; | 2851 | int ret; |
2840 | ret = zone_wait_table_init(zone, size); | 2852 | ret = zone_wait_table_init(zone, size); |
2841 | if (ret) | 2853 | if (ret) |
2842 | return ret; | 2854 | return ret; |
2843 | pgdat->nr_zones = zone_idx(zone) + 1; | 2855 | pgdat->nr_zones = zone_idx(zone) + 1; |
2844 | 2856 | ||
2845 | zone->zone_start_pfn = zone_start_pfn; | 2857 | zone->zone_start_pfn = zone_start_pfn; |
2846 | 2858 | ||
2847 | memmap_init(size, pgdat->node_id, zone_idx(zone), zone_start_pfn); | 2859 | memmap_init(size, pgdat->node_id, zone_idx(zone), zone_start_pfn); |
2848 | 2860 | ||
2849 | zone_init_free_lists(zone); | 2861 | zone_init_free_lists(zone); |
2850 | 2862 | ||
2851 | return 0; | 2863 | return 0; |
2852 | } | 2864 | } |
2853 | 2865 | ||
2854 | #ifdef CONFIG_ARCH_POPULATES_NODE_MAP | 2866 | #ifdef CONFIG_ARCH_POPULATES_NODE_MAP |
2855 | /* | 2867 | /* |
2856 | * Basic iterator support. Return the first range of PFNs for a node | 2868 | * Basic iterator support. Return the first range of PFNs for a node |
2857 | * Note: nid == MAX_NUMNODES returns first region regardless of node | 2869 | * Note: nid == MAX_NUMNODES returns first region regardless of node |
2858 | */ | 2870 | */ |
2859 | static int __meminit first_active_region_index_in_nid(int nid) | 2871 | static int __meminit first_active_region_index_in_nid(int nid) |
2860 | { | 2872 | { |
2861 | int i; | 2873 | int i; |
2862 | 2874 | ||
2863 | for (i = 0; i < nr_nodemap_entries; i++) | 2875 | for (i = 0; i < nr_nodemap_entries; i++) |
2864 | if (nid == MAX_NUMNODES || early_node_map[i].nid == nid) | 2876 | if (nid == MAX_NUMNODES || early_node_map[i].nid == nid) |
2865 | return i; | 2877 | return i; |
2866 | 2878 | ||
2867 | return -1; | 2879 | return -1; |
2868 | } | 2880 | } |
2869 | 2881 | ||
2870 | /* | 2882 | /* |
2871 | * Basic iterator support. Return the next active range of PFNs for a node | 2883 | * Basic iterator support. Return the next active range of PFNs for a node |
2872 | * Note: nid == MAX_NUMNODES returns next region regardless of node | 2884 | * Note: nid == MAX_NUMNODES returns next region regardless of node |
2873 | */ | 2885 | */ |
2874 | static int __meminit next_active_region_index_in_nid(int index, int nid) | 2886 | static int __meminit next_active_region_index_in_nid(int index, int nid) |
2875 | { | 2887 | { |
2876 | for (index = index + 1; index < nr_nodemap_entries; index++) | 2888 | for (index = index + 1; index < nr_nodemap_entries; index++) |
2877 | if (nid == MAX_NUMNODES || early_node_map[index].nid == nid) | 2889 | if (nid == MAX_NUMNODES || early_node_map[index].nid == nid) |
2878 | return index; | 2890 | return index; |
2879 | 2891 | ||
2880 | return -1; | 2892 | return -1; |
2881 | } | 2893 | } |
2882 | 2894 | ||
2883 | #ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID | 2895 | #ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID |
2884 | /* | 2896 | /* |
2885 | * Required by SPARSEMEM. Given a PFN, return what node the PFN is on. | 2897 | * Required by SPARSEMEM. Given a PFN, return what node the PFN is on. |
2886 | * Architectures may implement their own version but if add_active_range() | 2898 | * Architectures may implement their own version but if add_active_range() |
2887 | * was used and there are no special requirements, this is a convenient | 2899 | * was used and there are no special requirements, this is a convenient |
2888 | * alternative | 2900 | * alternative |
2889 | */ | 2901 | */ |
2890 | int __meminit early_pfn_to_nid(unsigned long pfn) | 2902 | int __meminit early_pfn_to_nid(unsigned long pfn) |
2891 | { | 2903 | { |
2892 | int i; | 2904 | int i; |
2893 | 2905 | ||
2894 | for (i = 0; i < nr_nodemap_entries; i++) { | 2906 | for (i = 0; i < nr_nodemap_entries; i++) { |
2895 | unsigned long start_pfn = early_node_map[i].start_pfn; | 2907 | unsigned long start_pfn = early_node_map[i].start_pfn; |
2896 | unsigned long end_pfn = early_node_map[i].end_pfn; | 2908 | unsigned long end_pfn = early_node_map[i].end_pfn; |
2897 | 2909 | ||
2898 | if (start_pfn <= pfn && pfn < end_pfn) | 2910 | if (start_pfn <= pfn && pfn < end_pfn) |
2899 | return early_node_map[i].nid; | 2911 | return early_node_map[i].nid; |
2900 | } | 2912 | } |
2901 | 2913 | ||
2902 | return 0; | 2914 | return 0; |
2903 | } | 2915 | } |
2904 | #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */ | 2916 | #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */ |
2905 | 2917 | ||
2906 | /* Basic iterator support to walk early_node_map[] */ | 2918 | /* Basic iterator support to walk early_node_map[] */ |
2907 | #define for_each_active_range_index_in_nid(i, nid) \ | 2919 | #define for_each_active_range_index_in_nid(i, nid) \ |
2908 | for (i = first_active_region_index_in_nid(nid); i != -1; \ | 2920 | for (i = first_active_region_index_in_nid(nid); i != -1; \ |
2909 | i = next_active_region_index_in_nid(i, nid)) | 2921 | i = next_active_region_index_in_nid(i, nid)) |
2910 | 2922 | ||
2911 | /** | 2923 | /** |
2912 | * free_bootmem_with_active_regions - Call free_bootmem_node for each active range | 2924 | * free_bootmem_with_active_regions - Call free_bootmem_node for each active range |
2913 | * @nid: The node to free memory on. If MAX_NUMNODES, all nodes are freed. | 2925 | * @nid: The node to free memory on. If MAX_NUMNODES, all nodes are freed. |
2914 | * @max_low_pfn: The highest PFN that will be passed to free_bootmem_node | 2926 | * @max_low_pfn: The highest PFN that will be passed to free_bootmem_node |
2915 | * | 2927 | * |
2916 | * If an architecture guarantees that all ranges registered with | 2928 | * If an architecture guarantees that all ranges registered with |
2917 | * add_active_ranges() contain no holes and may be freed, this | 2929 | * add_active_ranges() contain no holes and may be freed, this |
2918 | * this function may be used instead of calling free_bootmem() manually. | 2930 | * this function may be used instead of calling free_bootmem() manually. |
2919 | */ | 2931 | */ |
2920 | void __init free_bootmem_with_active_regions(int nid, | 2932 | void __init free_bootmem_with_active_regions(int nid, |
2921 | unsigned long max_low_pfn) | 2933 | unsigned long max_low_pfn) |
2922 | { | 2934 | { |
2923 | int i; | 2935 | int i; |
2924 | 2936 | ||
2925 | for_each_active_range_index_in_nid(i, nid) { | 2937 | for_each_active_range_index_in_nid(i, nid) { |
2926 | unsigned long size_pages = 0; | 2938 | unsigned long size_pages = 0; |
2927 | unsigned long end_pfn = early_node_map[i].end_pfn; | 2939 | unsigned long end_pfn = early_node_map[i].end_pfn; |
2928 | 2940 | ||
2929 | if (early_node_map[i].start_pfn >= max_low_pfn) | 2941 | if (early_node_map[i].start_pfn >= max_low_pfn) |
2930 | continue; | 2942 | continue; |
2931 | 2943 | ||
2932 | if (end_pfn > max_low_pfn) | 2944 | if (end_pfn > max_low_pfn) |
2933 | end_pfn = max_low_pfn; | 2945 | end_pfn = max_low_pfn; |
2934 | 2946 | ||
2935 | size_pages = end_pfn - early_node_map[i].start_pfn; | 2947 | size_pages = end_pfn - early_node_map[i].start_pfn; |
2936 | free_bootmem_node(NODE_DATA(early_node_map[i].nid), | 2948 | free_bootmem_node(NODE_DATA(early_node_map[i].nid), |
2937 | PFN_PHYS(early_node_map[i].start_pfn), | 2949 | PFN_PHYS(early_node_map[i].start_pfn), |
2938 | size_pages << PAGE_SHIFT); | 2950 | size_pages << PAGE_SHIFT); |
2939 | } | 2951 | } |
2940 | } | 2952 | } |
2941 | 2953 | ||
2942 | /** | 2954 | /** |
2943 | * sparse_memory_present_with_active_regions - Call memory_present for each active range | 2955 | * sparse_memory_present_with_active_regions - Call memory_present for each active range |
2944 | * @nid: The node to call memory_present for. If MAX_NUMNODES, all nodes will be used. | 2956 | * @nid: The node to call memory_present for. If MAX_NUMNODES, all nodes will be used. |
2945 | * | 2957 | * |
2946 | * If an architecture guarantees that all ranges registered with | 2958 | * If an architecture guarantees that all ranges registered with |
2947 | * add_active_ranges() contain no holes and may be freed, this | 2959 | * add_active_ranges() contain no holes and may be freed, this |
2948 | * function may be used instead of calling memory_present() manually. | 2960 | * function may be used instead of calling memory_present() manually. |
2949 | */ | 2961 | */ |
2950 | void __init sparse_memory_present_with_active_regions(int nid) | 2962 | void __init sparse_memory_present_with_active_regions(int nid) |
2951 | { | 2963 | { |
2952 | int i; | 2964 | int i; |
2953 | 2965 | ||
2954 | for_each_active_range_index_in_nid(i, nid) | 2966 | for_each_active_range_index_in_nid(i, nid) |
2955 | memory_present(early_node_map[i].nid, | 2967 | memory_present(early_node_map[i].nid, |
2956 | early_node_map[i].start_pfn, | 2968 | early_node_map[i].start_pfn, |
2957 | early_node_map[i].end_pfn); | 2969 | early_node_map[i].end_pfn); |
2958 | } | 2970 | } |
2959 | 2971 | ||
2960 | /** | 2972 | /** |
2961 | * push_node_boundaries - Push node boundaries to at least the requested boundary | 2973 | * push_node_boundaries - Push node boundaries to at least the requested boundary |
2962 | * @nid: The nid of the node to push the boundary for | 2974 | * @nid: The nid of the node to push the boundary for |
2963 | * @start_pfn: The start pfn of the node | 2975 | * @start_pfn: The start pfn of the node |
2964 | * @end_pfn: The end pfn of the node | 2976 | * @end_pfn: The end pfn of the node |
2965 | * | 2977 | * |
2966 | * In reserve-based hot-add, mem_map is allocated that is unused until hotadd | 2978 | * In reserve-based hot-add, mem_map is allocated that is unused until hotadd |
2967 | * time. Specifically, on x86_64, SRAT will report ranges that can potentially | 2979 | * time. Specifically, on x86_64, SRAT will report ranges that can potentially |
2968 | * be hotplugged even though no physical memory exists. This function allows | 2980 | * be hotplugged even though no physical memory exists. This function allows |
2969 | * an arch to push out the node boundaries so mem_map is allocated that can | 2981 | * an arch to push out the node boundaries so mem_map is allocated that can |
2970 | * be used later. | 2982 | * be used later. |
2971 | */ | 2983 | */ |
2972 | #ifdef CONFIG_MEMORY_HOTPLUG_RESERVE | 2984 | #ifdef CONFIG_MEMORY_HOTPLUG_RESERVE |
2973 | void __init push_node_boundaries(unsigned int nid, | 2985 | void __init push_node_boundaries(unsigned int nid, |
2974 | unsigned long start_pfn, unsigned long end_pfn) | 2986 | unsigned long start_pfn, unsigned long end_pfn) |
2975 | { | 2987 | { |
2976 | printk(KERN_DEBUG "Entering push_node_boundaries(%u, %lu, %lu)\n", | 2988 | printk(KERN_DEBUG "Entering push_node_boundaries(%u, %lu, %lu)\n", |
2977 | nid, start_pfn, end_pfn); | 2989 | nid, start_pfn, end_pfn); |
2978 | 2990 | ||
2979 | /* Initialise the boundary for this node if necessary */ | 2991 | /* Initialise the boundary for this node if necessary */ |
2980 | if (node_boundary_end_pfn[nid] == 0) | 2992 | if (node_boundary_end_pfn[nid] == 0) |
2981 | node_boundary_start_pfn[nid] = -1UL; | 2993 | node_boundary_start_pfn[nid] = -1UL; |
2982 | 2994 | ||
2983 | /* Update the boundaries */ | 2995 | /* Update the boundaries */ |
2984 | if (node_boundary_start_pfn[nid] > start_pfn) | 2996 | if (node_boundary_start_pfn[nid] > start_pfn) |
2985 | node_boundary_start_pfn[nid] = start_pfn; | 2997 | node_boundary_start_pfn[nid] = start_pfn; |
2986 | if (node_boundary_end_pfn[nid] < end_pfn) | 2998 | if (node_boundary_end_pfn[nid] < end_pfn) |
2987 | node_boundary_end_pfn[nid] = end_pfn; | 2999 | node_boundary_end_pfn[nid] = end_pfn; |
2988 | } | 3000 | } |
2989 | 3001 | ||
2990 | /* If necessary, push the node boundary out for reserve hotadd */ | 3002 | /* If necessary, push the node boundary out for reserve hotadd */ |
2991 | static void __meminit account_node_boundary(unsigned int nid, | 3003 | static void __meminit account_node_boundary(unsigned int nid, |
2992 | unsigned long *start_pfn, unsigned long *end_pfn) | 3004 | unsigned long *start_pfn, unsigned long *end_pfn) |
2993 | { | 3005 | { |
2994 | printk(KERN_DEBUG "Entering account_node_boundary(%u, %lu, %lu)\n", | 3006 | printk(KERN_DEBUG "Entering account_node_boundary(%u, %lu, %lu)\n", |
2995 | nid, *start_pfn, *end_pfn); | 3007 | nid, *start_pfn, *end_pfn); |
2996 | 3008 | ||
2997 | /* Return if boundary information has not been provided */ | 3009 | /* Return if boundary information has not been provided */ |
2998 | if (node_boundary_end_pfn[nid] == 0) | 3010 | if (node_boundary_end_pfn[nid] == 0) |
2999 | return; | 3011 | return; |
3000 | 3012 | ||
3001 | /* Check the boundaries and update if necessary */ | 3013 | /* Check the boundaries and update if necessary */ |
3002 | if (node_boundary_start_pfn[nid] < *start_pfn) | 3014 | if (node_boundary_start_pfn[nid] < *start_pfn) |
3003 | *start_pfn = node_boundary_start_pfn[nid]; | 3015 | *start_pfn = node_boundary_start_pfn[nid]; |
3004 | if (node_boundary_end_pfn[nid] > *end_pfn) | 3016 | if (node_boundary_end_pfn[nid] > *end_pfn) |
3005 | *end_pfn = node_boundary_end_pfn[nid]; | 3017 | *end_pfn = node_boundary_end_pfn[nid]; |
3006 | } | 3018 | } |
3007 | #else | 3019 | #else |
3008 | void __init push_node_boundaries(unsigned int nid, | 3020 | void __init push_node_boundaries(unsigned int nid, |
3009 | unsigned long start_pfn, unsigned long end_pfn) {} | 3021 | unsigned long start_pfn, unsigned long end_pfn) {} |
3010 | 3022 | ||
3011 | static void __meminit account_node_boundary(unsigned int nid, | 3023 | static void __meminit account_node_boundary(unsigned int nid, |
3012 | unsigned long *start_pfn, unsigned long *end_pfn) {} | 3024 | unsigned long *start_pfn, unsigned long *end_pfn) {} |
3013 | #endif | 3025 | #endif |
3014 | 3026 | ||
3015 | 3027 | ||
3016 | /** | 3028 | /** |
3017 | * get_pfn_range_for_nid - Return the start and end page frames for a node | 3029 | * get_pfn_range_for_nid - Return the start and end page frames for a node |
3018 | * @nid: The nid to return the range for. If MAX_NUMNODES, the min and max PFN are returned. | 3030 | * @nid: The nid to return the range for. If MAX_NUMNODES, the min and max PFN are returned. |
3019 | * @start_pfn: Passed by reference. On return, it will have the node start_pfn. | 3031 | * @start_pfn: Passed by reference. On return, it will have the node start_pfn. |
3020 | * @end_pfn: Passed by reference. On return, it will have the node end_pfn. | 3032 | * @end_pfn: Passed by reference. On return, it will have the node end_pfn. |
3021 | * | 3033 | * |
3022 | * It returns the start and end page frame of a node based on information | 3034 | * It returns the start and end page frame of a node based on information |
3023 | * provided by an arch calling add_active_range(). If called for a node | 3035 | * provided by an arch calling add_active_range(). If called for a node |
3024 | * with no available memory, a warning is printed and the start and end | 3036 | * with no available memory, a warning is printed and the start and end |
3025 | * PFNs will be 0. | 3037 | * PFNs will be 0. |
3026 | */ | 3038 | */ |
3027 | void __meminit get_pfn_range_for_nid(unsigned int nid, | 3039 | void __meminit get_pfn_range_for_nid(unsigned int nid, |
3028 | unsigned long *start_pfn, unsigned long *end_pfn) | 3040 | unsigned long *start_pfn, unsigned long *end_pfn) |
3029 | { | 3041 | { |
3030 | int i; | 3042 | int i; |
3031 | *start_pfn = -1UL; | 3043 | *start_pfn = -1UL; |
3032 | *end_pfn = 0; | 3044 | *end_pfn = 0; |
3033 | 3045 | ||
3034 | for_each_active_range_index_in_nid(i, nid) { | 3046 | for_each_active_range_index_in_nid(i, nid) { |
3035 | *start_pfn = min(*start_pfn, early_node_map[i].start_pfn); | 3047 | *start_pfn = min(*start_pfn, early_node_map[i].start_pfn); |
3036 | *end_pfn = max(*end_pfn, early_node_map[i].end_pfn); | 3048 | *end_pfn = max(*end_pfn, early_node_map[i].end_pfn); |
3037 | } | 3049 | } |
3038 | 3050 | ||
3039 | if (*start_pfn == -1UL) | 3051 | if (*start_pfn == -1UL) |
3040 | *start_pfn = 0; | 3052 | *start_pfn = 0; |
3041 | 3053 | ||
3042 | /* Push the node boundaries out if requested */ | 3054 | /* Push the node boundaries out if requested */ |
3043 | account_node_boundary(nid, start_pfn, end_pfn); | 3055 | account_node_boundary(nid, start_pfn, end_pfn); |
3044 | } | 3056 | } |
3045 | 3057 | ||
3046 | /* | 3058 | /* |
3047 | * This finds a zone that can be used for ZONE_MOVABLE pages. The | 3059 | * This finds a zone that can be used for ZONE_MOVABLE pages. The |
3048 | * assumption is made that zones within a node are ordered in monotonic | 3060 | * assumption is made that zones within a node are ordered in monotonic |
3049 | * increasing memory addresses so that the "highest" populated zone is used | 3061 | * increasing memory addresses so that the "highest" populated zone is used |
3050 | */ | 3062 | */ |
3051 | void __init find_usable_zone_for_movable(void) | 3063 | void __init find_usable_zone_for_movable(void) |
3052 | { | 3064 | { |
3053 | int zone_index; | 3065 | int zone_index; |
3054 | for (zone_index = MAX_NR_ZONES - 1; zone_index >= 0; zone_index--) { | 3066 | for (zone_index = MAX_NR_ZONES - 1; zone_index >= 0; zone_index--) { |
3055 | if (zone_index == ZONE_MOVABLE) | 3067 | if (zone_index == ZONE_MOVABLE) |
3056 | continue; | 3068 | continue; |
3057 | 3069 | ||
3058 | if (arch_zone_highest_possible_pfn[zone_index] > | 3070 | if (arch_zone_highest_possible_pfn[zone_index] > |
3059 | arch_zone_lowest_possible_pfn[zone_index]) | 3071 | arch_zone_lowest_possible_pfn[zone_index]) |
3060 | break; | 3072 | break; |
3061 | } | 3073 | } |
3062 | 3074 | ||
3063 | VM_BUG_ON(zone_index == -1); | 3075 | VM_BUG_ON(zone_index == -1); |
3064 | movable_zone = zone_index; | 3076 | movable_zone = zone_index; |
3065 | } | 3077 | } |
3066 | 3078 | ||
3067 | /* | 3079 | /* |
3068 | * The zone ranges provided by the architecture do not include ZONE_MOVABLE | 3080 | * The zone ranges provided by the architecture do not include ZONE_MOVABLE |
3069 | * because it is sized independant of architecture. Unlike the other zones, | 3081 | * because it is sized independant of architecture. Unlike the other zones, |
3070 | * the starting point for ZONE_MOVABLE is not fixed. It may be different | 3082 | * the starting point for ZONE_MOVABLE is not fixed. It may be different |
3071 | * in each node depending on the size of each node and how evenly kernelcore | 3083 | * in each node depending on the size of each node and how evenly kernelcore |
3072 | * is distributed. This helper function adjusts the zone ranges | 3084 | * is distributed. This helper function adjusts the zone ranges |
3073 | * provided by the architecture for a given node by using the end of the | 3085 | * provided by the architecture for a given node by using the end of the |
3074 | * highest usable zone for ZONE_MOVABLE. This preserves the assumption that | 3086 | * highest usable zone for ZONE_MOVABLE. This preserves the assumption that |
3075 | * zones within a node are in order of monotonic increases memory addresses | 3087 | * zones within a node are in order of monotonic increases memory addresses |
3076 | */ | 3088 | */ |
3077 | void __meminit adjust_zone_range_for_zone_movable(int nid, | 3089 | void __meminit adjust_zone_range_for_zone_movable(int nid, |
3078 | unsigned long zone_type, | 3090 | unsigned long zone_type, |
3079 | unsigned long node_start_pfn, | 3091 | unsigned long node_start_pfn, |
3080 | unsigned long node_end_pfn, | 3092 | unsigned long node_end_pfn, |
3081 | unsigned long *zone_start_pfn, | 3093 | unsigned long *zone_start_pfn, |
3082 | unsigned long *zone_end_pfn) | 3094 | unsigned long *zone_end_pfn) |
3083 | { | 3095 | { |
3084 | /* Only adjust if ZONE_MOVABLE is on this node */ | 3096 | /* Only adjust if ZONE_MOVABLE is on this node */ |
3085 | if (zone_movable_pfn[nid]) { | 3097 | if (zone_movable_pfn[nid]) { |
3086 | /* Size ZONE_MOVABLE */ | 3098 | /* Size ZONE_MOVABLE */ |
3087 | if (zone_type == ZONE_MOVABLE) { | 3099 | if (zone_type == ZONE_MOVABLE) { |
3088 | *zone_start_pfn = zone_movable_pfn[nid]; | 3100 | *zone_start_pfn = zone_movable_pfn[nid]; |
3089 | *zone_end_pfn = min(node_end_pfn, | 3101 | *zone_end_pfn = min(node_end_pfn, |
3090 | arch_zone_highest_possible_pfn[movable_zone]); | 3102 | arch_zone_highest_possible_pfn[movable_zone]); |
3091 | 3103 | ||
3092 | /* Adjust for ZONE_MOVABLE starting within this range */ | 3104 | /* Adjust for ZONE_MOVABLE starting within this range */ |
3093 | } else if (*zone_start_pfn < zone_movable_pfn[nid] && | 3105 | } else if (*zone_start_pfn < zone_movable_pfn[nid] && |
3094 | *zone_end_pfn > zone_movable_pfn[nid]) { | 3106 | *zone_end_pfn > zone_movable_pfn[nid]) { |
3095 | *zone_end_pfn = zone_movable_pfn[nid]; | 3107 | *zone_end_pfn = zone_movable_pfn[nid]; |
3096 | 3108 | ||
3097 | /* Check if this whole range is within ZONE_MOVABLE */ | 3109 | /* Check if this whole range is within ZONE_MOVABLE */ |
3098 | } else if (*zone_start_pfn >= zone_movable_pfn[nid]) | 3110 | } else if (*zone_start_pfn >= zone_movable_pfn[nid]) |
3099 | *zone_start_pfn = *zone_end_pfn; | 3111 | *zone_start_pfn = *zone_end_pfn; |
3100 | } | 3112 | } |
3101 | } | 3113 | } |
3102 | 3114 | ||
3103 | /* | 3115 | /* |
3104 | * Return the number of pages a zone spans in a node, including holes | 3116 | * Return the number of pages a zone spans in a node, including holes |
3105 | * present_pages = zone_spanned_pages_in_node() - zone_absent_pages_in_node() | 3117 | * present_pages = zone_spanned_pages_in_node() - zone_absent_pages_in_node() |
3106 | */ | 3118 | */ |
3107 | static unsigned long __meminit zone_spanned_pages_in_node(int nid, | 3119 | static unsigned long __meminit zone_spanned_pages_in_node(int nid, |
3108 | unsigned long zone_type, | 3120 | unsigned long zone_type, |
3109 | unsigned long *ignored) | 3121 | unsigned long *ignored) |
3110 | { | 3122 | { |
3111 | unsigned long node_start_pfn, node_end_pfn; | 3123 | unsigned long node_start_pfn, node_end_pfn; |
3112 | unsigned long zone_start_pfn, zone_end_pfn; | 3124 | unsigned long zone_start_pfn, zone_end_pfn; |
3113 | 3125 | ||
3114 | /* Get the start and end of the node and zone */ | 3126 | /* Get the start and end of the node and zone */ |
3115 | get_pfn_range_for_nid(nid, &node_start_pfn, &node_end_pfn); | 3127 | get_pfn_range_for_nid(nid, &node_start_pfn, &node_end_pfn); |
3116 | zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type]; | 3128 | zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type]; |
3117 | zone_end_pfn = arch_zone_highest_possible_pfn[zone_type]; | 3129 | zone_end_pfn = arch_zone_highest_possible_pfn[zone_type]; |
3118 | adjust_zone_range_for_zone_movable(nid, zone_type, | 3130 | adjust_zone_range_for_zone_movable(nid, zone_type, |
3119 | node_start_pfn, node_end_pfn, | 3131 | node_start_pfn, node_end_pfn, |
3120 | &zone_start_pfn, &zone_end_pfn); | 3132 | &zone_start_pfn, &zone_end_pfn); |
3121 | 3133 | ||
3122 | /* Check that this node has pages within the zone's required range */ | 3134 | /* Check that this node has pages within the zone's required range */ |
3123 | if (zone_end_pfn < node_start_pfn || zone_start_pfn > node_end_pfn) | 3135 | if (zone_end_pfn < node_start_pfn || zone_start_pfn > node_end_pfn) |
3124 | return 0; | 3136 | return 0; |
3125 | 3137 | ||
3126 | /* Move the zone boundaries inside the node if necessary */ | 3138 | /* Move the zone boundaries inside the node if necessary */ |
3127 | zone_end_pfn = min(zone_end_pfn, node_end_pfn); | 3139 | zone_end_pfn = min(zone_end_pfn, node_end_pfn); |
3128 | zone_start_pfn = max(zone_start_pfn, node_start_pfn); | 3140 | zone_start_pfn = max(zone_start_pfn, node_start_pfn); |
3129 | 3141 | ||
3130 | /* Return the spanned pages */ | 3142 | /* Return the spanned pages */ |
3131 | return zone_end_pfn - zone_start_pfn; | 3143 | return zone_end_pfn - zone_start_pfn; |
3132 | } | 3144 | } |
3133 | 3145 | ||
3134 | /* | 3146 | /* |
3135 | * Return the number of holes in a range on a node. If nid is MAX_NUMNODES, | 3147 | * Return the number of holes in a range on a node. If nid is MAX_NUMNODES, |
3136 | * then all holes in the requested range will be accounted for. | 3148 | * then all holes in the requested range will be accounted for. |
3137 | */ | 3149 | */ |
3138 | unsigned long __meminit __absent_pages_in_range(int nid, | 3150 | unsigned long __meminit __absent_pages_in_range(int nid, |
3139 | unsigned long range_start_pfn, | 3151 | unsigned long range_start_pfn, |
3140 | unsigned long range_end_pfn) | 3152 | unsigned long range_end_pfn) |
3141 | { | 3153 | { |
3142 | int i = 0; | 3154 | int i = 0; |
3143 | unsigned long prev_end_pfn = 0, hole_pages = 0; | 3155 | unsigned long prev_end_pfn = 0, hole_pages = 0; |
3144 | unsigned long start_pfn; | 3156 | unsigned long start_pfn; |
3145 | 3157 | ||
3146 | /* Find the end_pfn of the first active range of pfns in the node */ | 3158 | /* Find the end_pfn of the first active range of pfns in the node */ |
3147 | i = first_active_region_index_in_nid(nid); | 3159 | i = first_active_region_index_in_nid(nid); |
3148 | if (i == -1) | 3160 | if (i == -1) |
3149 | return 0; | 3161 | return 0; |
3150 | 3162 | ||
3151 | prev_end_pfn = min(early_node_map[i].start_pfn, range_end_pfn); | 3163 | prev_end_pfn = min(early_node_map[i].start_pfn, range_end_pfn); |
3152 | 3164 | ||
3153 | /* Account for ranges before physical memory on this node */ | 3165 | /* Account for ranges before physical memory on this node */ |
3154 | if (early_node_map[i].start_pfn > range_start_pfn) | 3166 | if (early_node_map[i].start_pfn > range_start_pfn) |
3155 | hole_pages = prev_end_pfn - range_start_pfn; | 3167 | hole_pages = prev_end_pfn - range_start_pfn; |
3156 | 3168 | ||
3157 | /* Find all holes for the zone within the node */ | 3169 | /* Find all holes for the zone within the node */ |
3158 | for (; i != -1; i = next_active_region_index_in_nid(i, nid)) { | 3170 | for (; i != -1; i = next_active_region_index_in_nid(i, nid)) { |
3159 | 3171 | ||
3160 | /* No need to continue if prev_end_pfn is outside the zone */ | 3172 | /* No need to continue if prev_end_pfn is outside the zone */ |
3161 | if (prev_end_pfn >= range_end_pfn) | 3173 | if (prev_end_pfn >= range_end_pfn) |
3162 | break; | 3174 | break; |
3163 | 3175 | ||
3164 | /* Make sure the end of the zone is not within the hole */ | 3176 | /* Make sure the end of the zone is not within the hole */ |
3165 | start_pfn = min(early_node_map[i].start_pfn, range_end_pfn); | 3177 | start_pfn = min(early_node_map[i].start_pfn, range_end_pfn); |
3166 | prev_end_pfn = max(prev_end_pfn, range_start_pfn); | 3178 | prev_end_pfn = max(prev_end_pfn, range_start_pfn); |
3167 | 3179 | ||
3168 | /* Update the hole size cound and move on */ | 3180 | /* Update the hole size cound and move on */ |
3169 | if (start_pfn > range_start_pfn) { | 3181 | if (start_pfn > range_start_pfn) { |
3170 | BUG_ON(prev_end_pfn > start_pfn); | 3182 | BUG_ON(prev_end_pfn > start_pfn); |
3171 | hole_pages += start_pfn - prev_end_pfn; | 3183 | hole_pages += start_pfn - prev_end_pfn; |
3172 | } | 3184 | } |
3173 | prev_end_pfn = early_node_map[i].end_pfn; | 3185 | prev_end_pfn = early_node_map[i].end_pfn; |
3174 | } | 3186 | } |
3175 | 3187 | ||
3176 | /* Account for ranges past physical memory on this node */ | 3188 | /* Account for ranges past physical memory on this node */ |
3177 | if (range_end_pfn > prev_end_pfn) | 3189 | if (range_end_pfn > prev_end_pfn) |
3178 | hole_pages += range_end_pfn - | 3190 | hole_pages += range_end_pfn - |
3179 | max(range_start_pfn, prev_end_pfn); | 3191 | max(range_start_pfn, prev_end_pfn); |
3180 | 3192 | ||
3181 | return hole_pages; | 3193 | return hole_pages; |
3182 | } | 3194 | } |
3183 | 3195 | ||
3184 | /** | 3196 | /** |
3185 | * absent_pages_in_range - Return number of page frames in holes within a range | 3197 | * absent_pages_in_range - Return number of page frames in holes within a range |
3186 | * @start_pfn: The start PFN to start searching for holes | 3198 | * @start_pfn: The start PFN to start searching for holes |
3187 | * @end_pfn: The end PFN to stop searching for holes | 3199 | * @end_pfn: The end PFN to stop searching for holes |
3188 | * | 3200 | * |
3189 | * It returns the number of pages frames in memory holes within a range. | 3201 | * It returns the number of pages frames in memory holes within a range. |
3190 | */ | 3202 | */ |
3191 | unsigned long __init absent_pages_in_range(unsigned long start_pfn, | 3203 | unsigned long __init absent_pages_in_range(unsigned long start_pfn, |
3192 | unsigned long end_pfn) | 3204 | unsigned long end_pfn) |
3193 | { | 3205 | { |
3194 | return __absent_pages_in_range(MAX_NUMNODES, start_pfn, end_pfn); | 3206 | return __absent_pages_in_range(MAX_NUMNODES, start_pfn, end_pfn); |
3195 | } | 3207 | } |
3196 | 3208 | ||
3197 | /* Return the number of page frames in holes in a zone on a node */ | 3209 | /* Return the number of page frames in holes in a zone on a node */ |
3198 | static unsigned long __meminit zone_absent_pages_in_node(int nid, | 3210 | static unsigned long __meminit zone_absent_pages_in_node(int nid, |
3199 | unsigned long zone_type, | 3211 | unsigned long zone_type, |
3200 | unsigned long *ignored) | 3212 | unsigned long *ignored) |
3201 | { | 3213 | { |
3202 | unsigned long node_start_pfn, node_end_pfn; | 3214 | unsigned long node_start_pfn, node_end_pfn; |
3203 | unsigned long zone_start_pfn, zone_end_pfn; | 3215 | unsigned long zone_start_pfn, zone_end_pfn; |
3204 | 3216 | ||
3205 | get_pfn_range_for_nid(nid, &node_start_pfn, &node_end_pfn); | 3217 | get_pfn_range_for_nid(nid, &node_start_pfn, &node_end_pfn); |
3206 | zone_start_pfn = max(arch_zone_lowest_possible_pfn[zone_type], | 3218 | zone_start_pfn = max(arch_zone_lowest_possible_pfn[zone_type], |
3207 | node_start_pfn); | 3219 | node_start_pfn); |
3208 | zone_end_pfn = min(arch_zone_highest_possible_pfn[zone_type], | 3220 | zone_end_pfn = min(arch_zone_highest_possible_pfn[zone_type], |
3209 | node_end_pfn); | 3221 | node_end_pfn); |
3210 | 3222 | ||
3211 | adjust_zone_range_for_zone_movable(nid, zone_type, | 3223 | adjust_zone_range_for_zone_movable(nid, zone_type, |
3212 | node_start_pfn, node_end_pfn, | 3224 | node_start_pfn, node_end_pfn, |
3213 | &zone_start_pfn, &zone_end_pfn); | 3225 | &zone_start_pfn, &zone_end_pfn); |
3214 | return __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn); | 3226 | return __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn); |
3215 | } | 3227 | } |
3216 | 3228 | ||
3217 | #else | 3229 | #else |
3218 | static inline unsigned long __meminit zone_spanned_pages_in_node(int nid, | 3230 | static inline unsigned long __meminit zone_spanned_pages_in_node(int nid, |
3219 | unsigned long zone_type, | 3231 | unsigned long zone_type, |
3220 | unsigned long *zones_size) | 3232 | unsigned long *zones_size) |
3221 | { | 3233 | { |
3222 | return zones_size[zone_type]; | 3234 | return zones_size[zone_type]; |
3223 | } | 3235 | } |
3224 | 3236 | ||
3225 | static inline unsigned long __meminit zone_absent_pages_in_node(int nid, | 3237 | static inline unsigned long __meminit zone_absent_pages_in_node(int nid, |
3226 | unsigned long zone_type, | 3238 | unsigned long zone_type, |
3227 | unsigned long *zholes_size) | 3239 | unsigned long *zholes_size) |
3228 | { | 3240 | { |
3229 | if (!zholes_size) | 3241 | if (!zholes_size) |
3230 | return 0; | 3242 | return 0; |
3231 | 3243 | ||
3232 | return zholes_size[zone_type]; | 3244 | return zholes_size[zone_type]; |
3233 | } | 3245 | } |
3234 | 3246 | ||
3235 | #endif | 3247 | #endif |
3236 | 3248 | ||
3237 | static void __meminit calculate_node_totalpages(struct pglist_data *pgdat, | 3249 | static void __meminit calculate_node_totalpages(struct pglist_data *pgdat, |
3238 | unsigned long *zones_size, unsigned long *zholes_size) | 3250 | unsigned long *zones_size, unsigned long *zholes_size) |
3239 | { | 3251 | { |
3240 | unsigned long realtotalpages, totalpages = 0; | 3252 | unsigned long realtotalpages, totalpages = 0; |
3241 | enum zone_type i; | 3253 | enum zone_type i; |
3242 | 3254 | ||
3243 | for (i = 0; i < MAX_NR_ZONES; i++) | 3255 | for (i = 0; i < MAX_NR_ZONES; i++) |
3244 | totalpages += zone_spanned_pages_in_node(pgdat->node_id, i, | 3256 | totalpages += zone_spanned_pages_in_node(pgdat->node_id, i, |
3245 | zones_size); | 3257 | zones_size); |
3246 | pgdat->node_spanned_pages = totalpages; | 3258 | pgdat->node_spanned_pages = totalpages; |
3247 | 3259 | ||
3248 | realtotalpages = totalpages; | 3260 | realtotalpages = totalpages; |
3249 | for (i = 0; i < MAX_NR_ZONES; i++) | 3261 | for (i = 0; i < MAX_NR_ZONES; i++) |
3250 | realtotalpages -= | 3262 | realtotalpages -= |
3251 | zone_absent_pages_in_node(pgdat->node_id, i, | 3263 | zone_absent_pages_in_node(pgdat->node_id, i, |
3252 | zholes_size); | 3264 | zholes_size); |
3253 | pgdat->node_present_pages = realtotalpages; | 3265 | pgdat->node_present_pages = realtotalpages; |
3254 | printk(KERN_DEBUG "On node %d totalpages: %lu\n", pgdat->node_id, | 3266 | printk(KERN_DEBUG "On node %d totalpages: %lu\n", pgdat->node_id, |
3255 | realtotalpages); | 3267 | realtotalpages); |
3256 | } | 3268 | } |
3257 | 3269 | ||
3258 | #ifndef CONFIG_SPARSEMEM | 3270 | #ifndef CONFIG_SPARSEMEM |
3259 | /* | 3271 | /* |
3260 | * Calculate the size of the zone->blockflags rounded to an unsigned long | 3272 | * Calculate the size of the zone->blockflags rounded to an unsigned long |
3261 | * Start by making sure zonesize is a multiple of pageblock_order by rounding | 3273 | * Start by making sure zonesize is a multiple of pageblock_order by rounding |
3262 | * up. Then use 1 NR_PAGEBLOCK_BITS worth of bits per pageblock, finally | 3274 | * up. Then use 1 NR_PAGEBLOCK_BITS worth of bits per pageblock, finally |
3263 | * round what is now in bits to nearest long in bits, then return it in | 3275 | * round what is now in bits to nearest long in bits, then return it in |
3264 | * bytes. | 3276 | * bytes. |
3265 | */ | 3277 | */ |
3266 | static unsigned long __init usemap_size(unsigned long zonesize) | 3278 | static unsigned long __init usemap_size(unsigned long zonesize) |
3267 | { | 3279 | { |
3268 | unsigned long usemapsize; | 3280 | unsigned long usemapsize; |
3269 | 3281 | ||
3270 | usemapsize = roundup(zonesize, pageblock_nr_pages); | 3282 | usemapsize = roundup(zonesize, pageblock_nr_pages); |
3271 | usemapsize = usemapsize >> pageblock_order; | 3283 | usemapsize = usemapsize >> pageblock_order; |
3272 | usemapsize *= NR_PAGEBLOCK_BITS; | 3284 | usemapsize *= NR_PAGEBLOCK_BITS; |
3273 | usemapsize = roundup(usemapsize, 8 * sizeof(unsigned long)); | 3285 | usemapsize = roundup(usemapsize, 8 * sizeof(unsigned long)); |
3274 | 3286 | ||
3275 | return usemapsize / 8; | 3287 | return usemapsize / 8; |
3276 | } | 3288 | } |
3277 | 3289 | ||
3278 | static void __init setup_usemap(struct pglist_data *pgdat, | 3290 | static void __init setup_usemap(struct pglist_data *pgdat, |
3279 | struct zone *zone, unsigned long zonesize) | 3291 | struct zone *zone, unsigned long zonesize) |
3280 | { | 3292 | { |
3281 | unsigned long usemapsize = usemap_size(zonesize); | 3293 | unsigned long usemapsize = usemap_size(zonesize); |
3282 | zone->pageblock_flags = NULL; | 3294 | zone->pageblock_flags = NULL; |
3283 | if (usemapsize) { | 3295 | if (usemapsize) { |
3284 | zone->pageblock_flags = alloc_bootmem_node(pgdat, usemapsize); | 3296 | zone->pageblock_flags = alloc_bootmem_node(pgdat, usemapsize); |
3285 | memset(zone->pageblock_flags, 0, usemapsize); | 3297 | memset(zone->pageblock_flags, 0, usemapsize); |
3286 | } | 3298 | } |
3287 | } | 3299 | } |
3288 | #else | 3300 | #else |
3289 | static void inline setup_usemap(struct pglist_data *pgdat, | 3301 | static void inline setup_usemap(struct pglist_data *pgdat, |
3290 | struct zone *zone, unsigned long zonesize) {} | 3302 | struct zone *zone, unsigned long zonesize) {} |
3291 | #endif /* CONFIG_SPARSEMEM */ | 3303 | #endif /* CONFIG_SPARSEMEM */ |
3292 | 3304 | ||
3293 | #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE | 3305 | #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE |
3294 | 3306 | ||
3295 | /* Return a sensible default order for the pageblock size. */ | 3307 | /* Return a sensible default order for the pageblock size. */ |
3296 | static inline int pageblock_default_order(void) | 3308 | static inline int pageblock_default_order(void) |
3297 | { | 3309 | { |
3298 | if (HPAGE_SHIFT > PAGE_SHIFT) | 3310 | if (HPAGE_SHIFT > PAGE_SHIFT) |
3299 | return HUGETLB_PAGE_ORDER; | 3311 | return HUGETLB_PAGE_ORDER; |
3300 | 3312 | ||
3301 | return MAX_ORDER-1; | 3313 | return MAX_ORDER-1; |
3302 | } | 3314 | } |
3303 | 3315 | ||
3304 | /* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */ | 3316 | /* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */ |
3305 | static inline void __init set_pageblock_order(unsigned int order) | 3317 | static inline void __init set_pageblock_order(unsigned int order) |
3306 | { | 3318 | { |
3307 | /* Check that pageblock_nr_pages has not already been setup */ | 3319 | /* Check that pageblock_nr_pages has not already been setup */ |
3308 | if (pageblock_order) | 3320 | if (pageblock_order) |
3309 | return; | 3321 | return; |
3310 | 3322 | ||
3311 | /* | 3323 | /* |
3312 | * Assume the largest contiguous order of interest is a huge page. | 3324 | * Assume the largest contiguous order of interest is a huge page. |
3313 | * This value may be variable depending on boot parameters on IA64 | 3325 | * This value may be variable depending on boot parameters on IA64 |
3314 | */ | 3326 | */ |
3315 | pageblock_order = order; | 3327 | pageblock_order = order; |
3316 | } | 3328 | } |
3317 | #else /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */ | 3329 | #else /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */ |
3318 | 3330 | ||
3319 | /* | 3331 | /* |
3320 | * When CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is not set, set_pageblock_order() | 3332 | * When CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is not set, set_pageblock_order() |
3321 | * and pageblock_default_order() are unused as pageblock_order is set | 3333 | * and pageblock_default_order() are unused as pageblock_order is set |
3322 | * at compile-time. See include/linux/pageblock-flags.h for the values of | 3334 | * at compile-time. See include/linux/pageblock-flags.h for the values of |
3323 | * pageblock_order based on the kernel config | 3335 | * pageblock_order based on the kernel config |
3324 | */ | 3336 | */ |
3325 | static inline int pageblock_default_order(unsigned int order) | 3337 | static inline int pageblock_default_order(unsigned int order) |
3326 | { | 3338 | { |
3327 | return MAX_ORDER-1; | 3339 | return MAX_ORDER-1; |
3328 | } | 3340 | } |
3329 | #define set_pageblock_order(x) do {} while (0) | 3341 | #define set_pageblock_order(x) do {} while (0) |
3330 | 3342 | ||
3331 | #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */ | 3343 | #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */ |
3332 | 3344 | ||
3333 | /* | 3345 | /* |
3334 | * Set up the zone data structures: | 3346 | * Set up the zone data structures: |
3335 | * - mark all pages reserved | 3347 | * - mark all pages reserved |
3336 | * - mark all memory queues empty | 3348 | * - mark all memory queues empty |
3337 | * - clear the memory bitmaps | 3349 | * - clear the memory bitmaps |
3338 | */ | 3350 | */ |
3339 | static void __paginginit free_area_init_core(struct pglist_data *pgdat, | 3351 | static void __paginginit free_area_init_core(struct pglist_data *pgdat, |
3340 | unsigned long *zones_size, unsigned long *zholes_size) | 3352 | unsigned long *zones_size, unsigned long *zholes_size) |
3341 | { | 3353 | { |
3342 | enum zone_type j; | 3354 | enum zone_type j; |
3343 | int nid = pgdat->node_id; | 3355 | int nid = pgdat->node_id; |
3344 | unsigned long zone_start_pfn = pgdat->node_start_pfn; | 3356 | unsigned long zone_start_pfn = pgdat->node_start_pfn; |
3345 | int ret; | 3357 | int ret; |
3346 | 3358 | ||
3347 | pgdat_resize_init(pgdat); | 3359 | pgdat_resize_init(pgdat); |
3348 | pgdat->nr_zones = 0; | 3360 | pgdat->nr_zones = 0; |
3349 | init_waitqueue_head(&pgdat->kswapd_wait); | 3361 | init_waitqueue_head(&pgdat->kswapd_wait); |
3350 | pgdat->kswapd_max_order = 0; | 3362 | pgdat->kswapd_max_order = 0; |
3351 | 3363 | ||
3352 | for (j = 0; j < MAX_NR_ZONES; j++) { | 3364 | for (j = 0; j < MAX_NR_ZONES; j++) { |
3353 | struct zone *zone = pgdat->node_zones + j; | 3365 | struct zone *zone = pgdat->node_zones + j; |
3354 | unsigned long size, realsize, memmap_pages; | 3366 | unsigned long size, realsize, memmap_pages; |
3355 | 3367 | ||
3356 | size = zone_spanned_pages_in_node(nid, j, zones_size); | 3368 | size = zone_spanned_pages_in_node(nid, j, zones_size); |
3357 | realsize = size - zone_absent_pages_in_node(nid, j, | 3369 | realsize = size - zone_absent_pages_in_node(nid, j, |
3358 | zholes_size); | 3370 | zholes_size); |
3359 | 3371 | ||
3360 | /* | 3372 | /* |
3361 | * Adjust realsize so that it accounts for how much memory | 3373 | * Adjust realsize so that it accounts for how much memory |
3362 | * is used by this zone for memmap. This affects the watermark | 3374 | * is used by this zone for memmap. This affects the watermark |
3363 | * and per-cpu initialisations | 3375 | * and per-cpu initialisations |
3364 | */ | 3376 | */ |
3365 | memmap_pages = (size * sizeof(struct page)) >> PAGE_SHIFT; | 3377 | memmap_pages = (size * sizeof(struct page)) >> PAGE_SHIFT; |
3366 | if (realsize >= memmap_pages) { | 3378 | if (realsize >= memmap_pages) { |
3367 | realsize -= memmap_pages; | 3379 | realsize -= memmap_pages; |
3368 | printk(KERN_DEBUG | 3380 | printk(KERN_DEBUG |
3369 | " %s zone: %lu pages used for memmap\n", | 3381 | " %s zone: %lu pages used for memmap\n", |
3370 | zone_names[j], memmap_pages); | 3382 | zone_names[j], memmap_pages); |
3371 | } else | 3383 | } else |
3372 | printk(KERN_WARNING | 3384 | printk(KERN_WARNING |
3373 | " %s zone: %lu pages exceeds realsize %lu\n", | 3385 | " %s zone: %lu pages exceeds realsize %lu\n", |
3374 | zone_names[j], memmap_pages, realsize); | 3386 | zone_names[j], memmap_pages, realsize); |
3375 | 3387 | ||
3376 | /* Account for reserved pages */ | 3388 | /* Account for reserved pages */ |
3377 | if (j == 0 && realsize > dma_reserve) { | 3389 | if (j == 0 && realsize > dma_reserve) { |
3378 | realsize -= dma_reserve; | 3390 | realsize -= dma_reserve; |
3379 | printk(KERN_DEBUG " %s zone: %lu pages reserved\n", | 3391 | printk(KERN_DEBUG " %s zone: %lu pages reserved\n", |
3380 | zone_names[0], dma_reserve); | 3392 | zone_names[0], dma_reserve); |
3381 | } | 3393 | } |
3382 | 3394 | ||
3383 | if (!is_highmem_idx(j)) | 3395 | if (!is_highmem_idx(j)) |
3384 | nr_kernel_pages += realsize; | 3396 | nr_kernel_pages += realsize; |
3385 | nr_all_pages += realsize; | 3397 | nr_all_pages += realsize; |
3386 | 3398 | ||
3387 | zone->spanned_pages = size; | 3399 | zone->spanned_pages = size; |
3388 | zone->present_pages = realsize; | 3400 | zone->present_pages = realsize; |
3389 | #ifdef CONFIG_NUMA | 3401 | #ifdef CONFIG_NUMA |
3390 | zone->node = nid; | 3402 | zone->node = nid; |
3391 | zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio) | 3403 | zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio) |
3392 | / 100; | 3404 | / 100; |
3393 | zone->min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100; | 3405 | zone->min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100; |
3394 | #endif | 3406 | #endif |
3395 | zone->name = zone_names[j]; | 3407 | zone->name = zone_names[j]; |
3396 | spin_lock_init(&zone->lock); | 3408 | spin_lock_init(&zone->lock); |
3397 | spin_lock_init(&zone->lru_lock); | 3409 | spin_lock_init(&zone->lru_lock); |
3398 | zone_seqlock_init(zone); | 3410 | zone_seqlock_init(zone); |
3399 | zone->zone_pgdat = pgdat; | 3411 | zone->zone_pgdat = pgdat; |
3400 | 3412 | ||
3401 | zone->prev_priority = DEF_PRIORITY; | 3413 | zone->prev_priority = DEF_PRIORITY; |
3402 | 3414 | ||
3403 | zone_pcp_init(zone); | 3415 | zone_pcp_init(zone); |
3404 | INIT_LIST_HEAD(&zone->active_list); | 3416 | INIT_LIST_HEAD(&zone->active_list); |
3405 | INIT_LIST_HEAD(&zone->inactive_list); | 3417 | INIT_LIST_HEAD(&zone->inactive_list); |
3406 | zone->nr_scan_active = 0; | 3418 | zone->nr_scan_active = 0; |
3407 | zone->nr_scan_inactive = 0; | 3419 | zone->nr_scan_inactive = 0; |
3408 | zap_zone_vm_stats(zone); | 3420 | zap_zone_vm_stats(zone); |
3409 | zone->flags = 0; | 3421 | zone->flags = 0; |
3410 | if (!size) | 3422 | if (!size) |
3411 | continue; | 3423 | continue; |
3412 | 3424 | ||
3413 | set_pageblock_order(pageblock_default_order()); | 3425 | set_pageblock_order(pageblock_default_order()); |
3414 | setup_usemap(pgdat, zone, size); | 3426 | setup_usemap(pgdat, zone, size); |
3415 | ret = init_currently_empty_zone(zone, zone_start_pfn, | 3427 | ret = init_currently_empty_zone(zone, zone_start_pfn, |
3416 | size, MEMMAP_EARLY); | 3428 | size, MEMMAP_EARLY); |
3417 | BUG_ON(ret); | 3429 | BUG_ON(ret); |
3418 | zone_start_pfn += size; | 3430 | zone_start_pfn += size; |
3419 | } | 3431 | } |
3420 | } | 3432 | } |
3421 | 3433 | ||
3422 | static void __init_refok alloc_node_mem_map(struct pglist_data *pgdat) | 3434 | static void __init_refok alloc_node_mem_map(struct pglist_data *pgdat) |
3423 | { | 3435 | { |
3424 | /* Skip empty nodes */ | 3436 | /* Skip empty nodes */ |
3425 | if (!pgdat->node_spanned_pages) | 3437 | if (!pgdat->node_spanned_pages) |
3426 | return; | 3438 | return; |
3427 | 3439 | ||
3428 | #ifdef CONFIG_FLAT_NODE_MEM_MAP | 3440 | #ifdef CONFIG_FLAT_NODE_MEM_MAP |
3429 | /* ia64 gets its own node_mem_map, before this, without bootmem */ | 3441 | /* ia64 gets its own node_mem_map, before this, without bootmem */ |
3430 | if (!pgdat->node_mem_map) { | 3442 | if (!pgdat->node_mem_map) { |
3431 | unsigned long size, start, end; | 3443 | unsigned long size, start, end; |
3432 | struct page *map; | 3444 | struct page *map; |
3433 | 3445 | ||
3434 | /* | 3446 | /* |
3435 | * The zone's endpoints aren't required to be MAX_ORDER | 3447 | * The zone's endpoints aren't required to be MAX_ORDER |
3436 | * aligned but the node_mem_map endpoints must be in order | 3448 | * aligned but the node_mem_map endpoints must be in order |
3437 | * for the buddy allocator to function correctly. | 3449 | * for the buddy allocator to function correctly. |
3438 | */ | 3450 | */ |
3439 | start = pgdat->node_start_pfn & ~(MAX_ORDER_NR_PAGES - 1); | 3451 | start = pgdat->node_start_pfn & ~(MAX_ORDER_NR_PAGES - 1); |
3440 | end = pgdat->node_start_pfn + pgdat->node_spanned_pages; | 3452 | end = pgdat->node_start_pfn + pgdat->node_spanned_pages; |
3441 | end = ALIGN(end, MAX_ORDER_NR_PAGES); | 3453 | end = ALIGN(end, MAX_ORDER_NR_PAGES); |
3442 | size = (end - start) * sizeof(struct page); | 3454 | size = (end - start) * sizeof(struct page); |
3443 | map = alloc_remap(pgdat->node_id, size); | 3455 | map = alloc_remap(pgdat->node_id, size); |
3444 | if (!map) | 3456 | if (!map) |
3445 | map = alloc_bootmem_node(pgdat, size); | 3457 | map = alloc_bootmem_node(pgdat, size); |
3446 | pgdat->node_mem_map = map + (pgdat->node_start_pfn - start); | 3458 | pgdat->node_mem_map = map + (pgdat->node_start_pfn - start); |
3447 | } | 3459 | } |
3448 | #ifndef CONFIG_NEED_MULTIPLE_NODES | 3460 | #ifndef CONFIG_NEED_MULTIPLE_NODES |
3449 | /* | 3461 | /* |
3450 | * With no DISCONTIG, the global mem_map is just set as node 0's | 3462 | * With no DISCONTIG, the global mem_map is just set as node 0's |
3451 | */ | 3463 | */ |
3452 | if (pgdat == NODE_DATA(0)) { | 3464 | if (pgdat == NODE_DATA(0)) { |
3453 | mem_map = NODE_DATA(0)->node_mem_map; | 3465 | mem_map = NODE_DATA(0)->node_mem_map; |
3454 | #ifdef CONFIG_ARCH_POPULATES_NODE_MAP | 3466 | #ifdef CONFIG_ARCH_POPULATES_NODE_MAP |
3455 | if (page_to_pfn(mem_map) != pgdat->node_start_pfn) | 3467 | if (page_to_pfn(mem_map) != pgdat->node_start_pfn) |
3456 | mem_map -= (pgdat->node_start_pfn - ARCH_PFN_OFFSET); | 3468 | mem_map -= (pgdat->node_start_pfn - ARCH_PFN_OFFSET); |
3457 | #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */ | 3469 | #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */ |
3458 | } | 3470 | } |
3459 | #endif | 3471 | #endif |
3460 | #endif /* CONFIG_FLAT_NODE_MEM_MAP */ | 3472 | #endif /* CONFIG_FLAT_NODE_MEM_MAP */ |
3461 | } | 3473 | } |
3462 | 3474 | ||
3463 | void __paginginit free_area_init_node(int nid, struct pglist_data *pgdat, | 3475 | void __paginginit free_area_init_node(int nid, struct pglist_data *pgdat, |
3464 | unsigned long *zones_size, unsigned long node_start_pfn, | 3476 | unsigned long *zones_size, unsigned long node_start_pfn, |
3465 | unsigned long *zholes_size) | 3477 | unsigned long *zholes_size) |
3466 | { | 3478 | { |
3467 | pgdat->node_id = nid; | 3479 | pgdat->node_id = nid; |
3468 | pgdat->node_start_pfn = node_start_pfn; | 3480 | pgdat->node_start_pfn = node_start_pfn; |
3469 | calculate_node_totalpages(pgdat, zones_size, zholes_size); | 3481 | calculate_node_totalpages(pgdat, zones_size, zholes_size); |
3470 | 3482 | ||
3471 | alloc_node_mem_map(pgdat); | 3483 | alloc_node_mem_map(pgdat); |
3472 | 3484 | ||
3473 | free_area_init_core(pgdat, zones_size, zholes_size); | 3485 | free_area_init_core(pgdat, zones_size, zholes_size); |
3474 | } | 3486 | } |
3475 | 3487 | ||
3476 | #ifdef CONFIG_ARCH_POPULATES_NODE_MAP | 3488 | #ifdef CONFIG_ARCH_POPULATES_NODE_MAP |
3477 | 3489 | ||
3478 | #if MAX_NUMNODES > 1 | 3490 | #if MAX_NUMNODES > 1 |
3479 | /* | 3491 | /* |
3480 | * Figure out the number of possible node ids. | 3492 | * Figure out the number of possible node ids. |
3481 | */ | 3493 | */ |
3482 | static void __init setup_nr_node_ids(void) | 3494 | static void __init setup_nr_node_ids(void) |
3483 | { | 3495 | { |
3484 | unsigned int node; | 3496 | unsigned int node; |
3485 | unsigned int highest = 0; | 3497 | unsigned int highest = 0; |
3486 | 3498 | ||
3487 | for_each_node_mask(node, node_possible_map) | 3499 | for_each_node_mask(node, node_possible_map) |
3488 | highest = node; | 3500 | highest = node; |
3489 | nr_node_ids = highest + 1; | 3501 | nr_node_ids = highest + 1; |
3490 | } | 3502 | } |
3491 | #else | 3503 | #else |
3492 | static inline void setup_nr_node_ids(void) | 3504 | static inline void setup_nr_node_ids(void) |
3493 | { | 3505 | { |
3494 | } | 3506 | } |
3495 | #endif | 3507 | #endif |
3496 | 3508 | ||
3497 | /** | 3509 | /** |
3498 | * add_active_range - Register a range of PFNs backed by physical memory | 3510 | * add_active_range - Register a range of PFNs backed by physical memory |
3499 | * @nid: The node ID the range resides on | 3511 | * @nid: The node ID the range resides on |
3500 | * @start_pfn: The start PFN of the available physical memory | 3512 | * @start_pfn: The start PFN of the available physical memory |
3501 | * @end_pfn: The end PFN of the available physical memory | 3513 | * @end_pfn: The end PFN of the available physical memory |
3502 | * | 3514 | * |
3503 | * These ranges are stored in an early_node_map[] and later used by | 3515 | * These ranges are stored in an early_node_map[] and later used by |
3504 | * free_area_init_nodes() to calculate zone sizes and holes. If the | 3516 | * free_area_init_nodes() to calculate zone sizes and holes. If the |
3505 | * range spans a memory hole, it is up to the architecture to ensure | 3517 | * range spans a memory hole, it is up to the architecture to ensure |
3506 | * the memory is not freed by the bootmem allocator. If possible | 3518 | * the memory is not freed by the bootmem allocator. If possible |
3507 | * the range being registered will be merged with existing ranges. | 3519 | * the range being registered will be merged with existing ranges. |
3508 | */ | 3520 | */ |
3509 | void __init add_active_range(unsigned int nid, unsigned long start_pfn, | 3521 | void __init add_active_range(unsigned int nid, unsigned long start_pfn, |
3510 | unsigned long end_pfn) | 3522 | unsigned long end_pfn) |
3511 | { | 3523 | { |
3512 | int i; | 3524 | int i; |
3513 | 3525 | ||
3514 | printk(KERN_DEBUG "Entering add_active_range(%d, %lu, %lu) " | 3526 | printk(KERN_DEBUG "Entering add_active_range(%d, %lu, %lu) " |
3515 | "%d entries of %d used\n", | 3527 | "%d entries of %d used\n", |
3516 | nid, start_pfn, end_pfn, | 3528 | nid, start_pfn, end_pfn, |
3517 | nr_nodemap_entries, MAX_ACTIVE_REGIONS); | 3529 | nr_nodemap_entries, MAX_ACTIVE_REGIONS); |
3518 | 3530 | ||
3519 | /* Merge with existing active regions if possible */ | 3531 | /* Merge with existing active regions if possible */ |
3520 | for (i = 0; i < nr_nodemap_entries; i++) { | 3532 | for (i = 0; i < nr_nodemap_entries; i++) { |
3521 | if (early_node_map[i].nid != nid) | 3533 | if (early_node_map[i].nid != nid) |
3522 | continue; | 3534 | continue; |
3523 | 3535 | ||
3524 | /* Skip if an existing region covers this new one */ | 3536 | /* Skip if an existing region covers this new one */ |
3525 | if (start_pfn >= early_node_map[i].start_pfn && | 3537 | if (start_pfn >= early_node_map[i].start_pfn && |
3526 | end_pfn <= early_node_map[i].end_pfn) | 3538 | end_pfn <= early_node_map[i].end_pfn) |
3527 | return; | 3539 | return; |
3528 | 3540 | ||
3529 | /* Merge forward if suitable */ | 3541 | /* Merge forward if suitable */ |
3530 | if (start_pfn <= early_node_map[i].end_pfn && | 3542 | if (start_pfn <= early_node_map[i].end_pfn && |
3531 | end_pfn > early_node_map[i].end_pfn) { | 3543 | end_pfn > early_node_map[i].end_pfn) { |
3532 | early_node_map[i].end_pfn = end_pfn; | 3544 | early_node_map[i].end_pfn = end_pfn; |
3533 | return; | 3545 | return; |
3534 | } | 3546 | } |
3535 | 3547 | ||
3536 | /* Merge backward if suitable */ | 3548 | /* Merge backward if suitable */ |
3537 | if (start_pfn < early_node_map[i].end_pfn && | 3549 | if (start_pfn < early_node_map[i].end_pfn && |
3538 | end_pfn >= early_node_map[i].start_pfn) { | 3550 | end_pfn >= early_node_map[i].start_pfn) { |
3539 | early_node_map[i].start_pfn = start_pfn; | 3551 | early_node_map[i].start_pfn = start_pfn; |
3540 | return; | 3552 | return; |
3541 | } | 3553 | } |
3542 | } | 3554 | } |
3543 | 3555 | ||
3544 | /* Check that early_node_map is large enough */ | 3556 | /* Check that early_node_map is large enough */ |
3545 | if (i >= MAX_ACTIVE_REGIONS) { | 3557 | if (i >= MAX_ACTIVE_REGIONS) { |
3546 | printk(KERN_CRIT "More than %d memory regions, truncating\n", | 3558 | printk(KERN_CRIT "More than %d memory regions, truncating\n", |
3547 | MAX_ACTIVE_REGIONS); | 3559 | MAX_ACTIVE_REGIONS); |
3548 | return; | 3560 | return; |
3549 | } | 3561 | } |
3550 | 3562 | ||
3551 | early_node_map[i].nid = nid; | 3563 | early_node_map[i].nid = nid; |
3552 | early_node_map[i].start_pfn = start_pfn; | 3564 | early_node_map[i].start_pfn = start_pfn; |
3553 | early_node_map[i].end_pfn = end_pfn; | 3565 | early_node_map[i].end_pfn = end_pfn; |
3554 | nr_nodemap_entries = i + 1; | 3566 | nr_nodemap_entries = i + 1; |
3555 | } | 3567 | } |
3556 | 3568 | ||
3557 | /** | 3569 | /** |
3558 | * shrink_active_range - Shrink an existing registered range of PFNs | 3570 | * shrink_active_range - Shrink an existing registered range of PFNs |
3559 | * @nid: The node id the range is on that should be shrunk | 3571 | * @nid: The node id the range is on that should be shrunk |
3560 | * @old_end_pfn: The old end PFN of the range | 3572 | * @old_end_pfn: The old end PFN of the range |
3561 | * @new_end_pfn: The new PFN of the range | 3573 | * @new_end_pfn: The new PFN of the range |
3562 | * | 3574 | * |
3563 | * i386 with NUMA use alloc_remap() to store a node_mem_map on a local node. | 3575 | * i386 with NUMA use alloc_remap() to store a node_mem_map on a local node. |
3564 | * The map is kept at the end physical page range that has already been | 3576 | * The map is kept at the end physical page range that has already been |
3565 | * registered with add_active_range(). This function allows an arch to shrink | 3577 | * registered with add_active_range(). This function allows an arch to shrink |
3566 | * an existing registered range. | 3578 | * an existing registered range. |
3567 | */ | 3579 | */ |
3568 | void __init shrink_active_range(unsigned int nid, unsigned long old_end_pfn, | 3580 | void __init shrink_active_range(unsigned int nid, unsigned long old_end_pfn, |
3569 | unsigned long new_end_pfn) | 3581 | unsigned long new_end_pfn) |
3570 | { | 3582 | { |
3571 | int i; | 3583 | int i; |
3572 | 3584 | ||
3573 | /* Find the old active region end and shrink */ | 3585 | /* Find the old active region end and shrink */ |
3574 | for_each_active_range_index_in_nid(i, nid) | 3586 | for_each_active_range_index_in_nid(i, nid) |
3575 | if (early_node_map[i].end_pfn == old_end_pfn) { | 3587 | if (early_node_map[i].end_pfn == old_end_pfn) { |
3576 | early_node_map[i].end_pfn = new_end_pfn; | 3588 | early_node_map[i].end_pfn = new_end_pfn; |
3577 | break; | 3589 | break; |
3578 | } | 3590 | } |
3579 | } | 3591 | } |
3580 | 3592 | ||
3581 | /** | 3593 | /** |
3582 | * remove_all_active_ranges - Remove all currently registered regions | 3594 | * remove_all_active_ranges - Remove all currently registered regions |
3583 | * | 3595 | * |
3584 | * During discovery, it may be found that a table like SRAT is invalid | 3596 | * During discovery, it may be found that a table like SRAT is invalid |
3585 | * and an alternative discovery method must be used. This function removes | 3597 | * and an alternative discovery method must be used. This function removes |
3586 | * all currently registered regions. | 3598 | * all currently registered regions. |
3587 | */ | 3599 | */ |
3588 | void __init remove_all_active_ranges(void) | 3600 | void __init remove_all_active_ranges(void) |
3589 | { | 3601 | { |
3590 | memset(early_node_map, 0, sizeof(early_node_map)); | 3602 | memset(early_node_map, 0, sizeof(early_node_map)); |
3591 | nr_nodemap_entries = 0; | 3603 | nr_nodemap_entries = 0; |
3592 | #ifdef CONFIG_MEMORY_HOTPLUG_RESERVE | 3604 | #ifdef CONFIG_MEMORY_HOTPLUG_RESERVE |
3593 | memset(node_boundary_start_pfn, 0, sizeof(node_boundary_start_pfn)); | 3605 | memset(node_boundary_start_pfn, 0, sizeof(node_boundary_start_pfn)); |
3594 | memset(node_boundary_end_pfn, 0, sizeof(node_boundary_end_pfn)); | 3606 | memset(node_boundary_end_pfn, 0, sizeof(node_boundary_end_pfn)); |
3595 | #endif /* CONFIG_MEMORY_HOTPLUG_RESERVE */ | 3607 | #endif /* CONFIG_MEMORY_HOTPLUG_RESERVE */ |
3596 | } | 3608 | } |
3597 | 3609 | ||
3598 | /* Compare two active node_active_regions */ | 3610 | /* Compare two active node_active_regions */ |
3599 | static int __init cmp_node_active_region(const void *a, const void *b) | 3611 | static int __init cmp_node_active_region(const void *a, const void *b) |
3600 | { | 3612 | { |
3601 | struct node_active_region *arange = (struct node_active_region *)a; | 3613 | struct node_active_region *arange = (struct node_active_region *)a; |
3602 | struct node_active_region *brange = (struct node_active_region *)b; | 3614 | struct node_active_region *brange = (struct node_active_region *)b; |
3603 | 3615 | ||
3604 | /* Done this way to avoid overflows */ | 3616 | /* Done this way to avoid overflows */ |
3605 | if (arange->start_pfn > brange->start_pfn) | 3617 | if (arange->start_pfn > brange->start_pfn) |
3606 | return 1; | 3618 | return 1; |
3607 | if (arange->start_pfn < brange->start_pfn) | 3619 | if (arange->start_pfn < brange->start_pfn) |
3608 | return -1; | 3620 | return -1; |
3609 | 3621 | ||
3610 | return 0; | 3622 | return 0; |
3611 | } | 3623 | } |
3612 | 3624 | ||
3613 | /* sort the node_map by start_pfn */ | 3625 | /* sort the node_map by start_pfn */ |
3614 | static void __init sort_node_map(void) | 3626 | static void __init sort_node_map(void) |
3615 | { | 3627 | { |
3616 | sort(early_node_map, (size_t)nr_nodemap_entries, | 3628 | sort(early_node_map, (size_t)nr_nodemap_entries, |
3617 | sizeof(struct node_active_region), | 3629 | sizeof(struct node_active_region), |
3618 | cmp_node_active_region, NULL); | 3630 | cmp_node_active_region, NULL); |
3619 | } | 3631 | } |
3620 | 3632 | ||
3621 | /* Find the lowest pfn for a node */ | 3633 | /* Find the lowest pfn for a node */ |
3622 | unsigned long __init find_min_pfn_for_node(unsigned long nid) | 3634 | unsigned long __init find_min_pfn_for_node(unsigned long nid) |
3623 | { | 3635 | { |
3624 | int i; | 3636 | int i; |
3625 | unsigned long min_pfn = ULONG_MAX; | 3637 | unsigned long min_pfn = ULONG_MAX; |
3626 | 3638 | ||
3627 | /* Assuming a sorted map, the first range found has the starting pfn */ | 3639 | /* Assuming a sorted map, the first range found has the starting pfn */ |
3628 | for_each_active_range_index_in_nid(i, nid) | 3640 | for_each_active_range_index_in_nid(i, nid) |
3629 | min_pfn = min(min_pfn, early_node_map[i].start_pfn); | 3641 | min_pfn = min(min_pfn, early_node_map[i].start_pfn); |
3630 | 3642 | ||
3631 | if (min_pfn == ULONG_MAX) { | 3643 | if (min_pfn == ULONG_MAX) { |
3632 | printk(KERN_WARNING | 3644 | printk(KERN_WARNING |
3633 | "Could not find start_pfn for node %lu\n", nid); | 3645 | "Could not find start_pfn for node %lu\n", nid); |
3634 | return 0; | 3646 | return 0; |
3635 | } | 3647 | } |
3636 | 3648 | ||
3637 | return min_pfn; | 3649 | return min_pfn; |
3638 | } | 3650 | } |
3639 | 3651 | ||
3640 | /** | 3652 | /** |
3641 | * find_min_pfn_with_active_regions - Find the minimum PFN registered | 3653 | * find_min_pfn_with_active_regions - Find the minimum PFN registered |
3642 | * | 3654 | * |
3643 | * It returns the minimum PFN based on information provided via | 3655 | * It returns the minimum PFN based on information provided via |
3644 | * add_active_range(). | 3656 | * add_active_range(). |
3645 | */ | 3657 | */ |
3646 | unsigned long __init find_min_pfn_with_active_regions(void) | 3658 | unsigned long __init find_min_pfn_with_active_regions(void) |
3647 | { | 3659 | { |
3648 | return find_min_pfn_for_node(MAX_NUMNODES); | 3660 | return find_min_pfn_for_node(MAX_NUMNODES); |
3649 | } | 3661 | } |
3650 | 3662 | ||
3651 | /** | 3663 | /** |
3652 | * find_max_pfn_with_active_regions - Find the maximum PFN registered | 3664 | * find_max_pfn_with_active_regions - Find the maximum PFN registered |
3653 | * | 3665 | * |
3654 | * It returns the maximum PFN based on information provided via | 3666 | * It returns the maximum PFN based on information provided via |
3655 | * add_active_range(). | 3667 | * add_active_range(). |
3656 | */ | 3668 | */ |
3657 | unsigned long __init find_max_pfn_with_active_regions(void) | 3669 | unsigned long __init find_max_pfn_with_active_regions(void) |
3658 | { | 3670 | { |
3659 | int i; | 3671 | int i; |
3660 | unsigned long max_pfn = 0; | 3672 | unsigned long max_pfn = 0; |
3661 | 3673 | ||
3662 | for (i = 0; i < nr_nodemap_entries; i++) | 3674 | for (i = 0; i < nr_nodemap_entries; i++) |
3663 | max_pfn = max(max_pfn, early_node_map[i].end_pfn); | 3675 | max_pfn = max(max_pfn, early_node_map[i].end_pfn); |
3664 | 3676 | ||
3665 | return max_pfn; | 3677 | return max_pfn; |
3666 | } | 3678 | } |
3667 | 3679 | ||
3668 | /* | 3680 | /* |
3669 | * early_calculate_totalpages() | 3681 | * early_calculate_totalpages() |
3670 | * Sum pages in active regions for movable zone. | 3682 | * Sum pages in active regions for movable zone. |
3671 | * Populate N_HIGH_MEMORY for calculating usable_nodes. | 3683 | * Populate N_HIGH_MEMORY for calculating usable_nodes. |
3672 | */ | 3684 | */ |
3673 | static unsigned long __init early_calculate_totalpages(void) | 3685 | static unsigned long __init early_calculate_totalpages(void) |
3674 | { | 3686 | { |
3675 | int i; | 3687 | int i; |
3676 | unsigned long totalpages = 0; | 3688 | unsigned long totalpages = 0; |
3677 | 3689 | ||
3678 | for (i = 0; i < nr_nodemap_entries; i++) { | 3690 | for (i = 0; i < nr_nodemap_entries; i++) { |
3679 | unsigned long pages = early_node_map[i].end_pfn - | 3691 | unsigned long pages = early_node_map[i].end_pfn - |
3680 | early_node_map[i].start_pfn; | 3692 | early_node_map[i].start_pfn; |
3681 | totalpages += pages; | 3693 | totalpages += pages; |
3682 | if (pages) | 3694 | if (pages) |
3683 | node_set_state(early_node_map[i].nid, N_HIGH_MEMORY); | 3695 | node_set_state(early_node_map[i].nid, N_HIGH_MEMORY); |
3684 | } | 3696 | } |
3685 | return totalpages; | 3697 | return totalpages; |
3686 | } | 3698 | } |
3687 | 3699 | ||
3688 | /* | 3700 | /* |
3689 | * Find the PFN the Movable zone begins in each node. Kernel memory | 3701 | * Find the PFN the Movable zone begins in each node. Kernel memory |
3690 | * is spread evenly between nodes as long as the nodes have enough | 3702 | * is spread evenly between nodes as long as the nodes have enough |
3691 | * memory. When they don't, some nodes will have more kernelcore than | 3703 | * memory. When they don't, some nodes will have more kernelcore than |
3692 | * others | 3704 | * others |
3693 | */ | 3705 | */ |
3694 | void __init find_zone_movable_pfns_for_nodes(unsigned long *movable_pfn) | 3706 | void __init find_zone_movable_pfns_for_nodes(unsigned long *movable_pfn) |
3695 | { | 3707 | { |
3696 | int i, nid; | 3708 | int i, nid; |
3697 | unsigned long usable_startpfn; | 3709 | unsigned long usable_startpfn; |
3698 | unsigned long kernelcore_node, kernelcore_remaining; | 3710 | unsigned long kernelcore_node, kernelcore_remaining; |
3699 | unsigned long totalpages = early_calculate_totalpages(); | 3711 | unsigned long totalpages = early_calculate_totalpages(); |
3700 | int usable_nodes = nodes_weight(node_states[N_HIGH_MEMORY]); | 3712 | int usable_nodes = nodes_weight(node_states[N_HIGH_MEMORY]); |
3701 | 3713 | ||
3702 | /* | 3714 | /* |
3703 | * If movablecore was specified, calculate what size of | 3715 | * If movablecore was specified, calculate what size of |
3704 | * kernelcore that corresponds so that memory usable for | 3716 | * kernelcore that corresponds so that memory usable for |
3705 | * any allocation type is evenly spread. If both kernelcore | 3717 | * any allocation type is evenly spread. If both kernelcore |
3706 | * and movablecore are specified, then the value of kernelcore | 3718 | * and movablecore are specified, then the value of kernelcore |
3707 | * will be used for required_kernelcore if it's greater than | 3719 | * will be used for required_kernelcore if it's greater than |
3708 | * what movablecore would have allowed. | 3720 | * what movablecore would have allowed. |
3709 | */ | 3721 | */ |
3710 | if (required_movablecore) { | 3722 | if (required_movablecore) { |
3711 | unsigned long corepages; | 3723 | unsigned long corepages; |
3712 | 3724 | ||
3713 | /* | 3725 | /* |
3714 | * Round-up so that ZONE_MOVABLE is at least as large as what | 3726 | * Round-up so that ZONE_MOVABLE is at least as large as what |
3715 | * was requested by the user | 3727 | * was requested by the user |
3716 | */ | 3728 | */ |
3717 | required_movablecore = | 3729 | required_movablecore = |
3718 | roundup(required_movablecore, MAX_ORDER_NR_PAGES); | 3730 | roundup(required_movablecore, MAX_ORDER_NR_PAGES); |
3719 | corepages = totalpages - required_movablecore; | 3731 | corepages = totalpages - required_movablecore; |
3720 | 3732 | ||
3721 | required_kernelcore = max(required_kernelcore, corepages); | 3733 | required_kernelcore = max(required_kernelcore, corepages); |
3722 | } | 3734 | } |
3723 | 3735 | ||
3724 | /* If kernelcore was not specified, there is no ZONE_MOVABLE */ | 3736 | /* If kernelcore was not specified, there is no ZONE_MOVABLE */ |
3725 | if (!required_kernelcore) | 3737 | if (!required_kernelcore) |
3726 | return; | 3738 | return; |
3727 | 3739 | ||
3728 | /* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */ | 3740 | /* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */ |
3729 | find_usable_zone_for_movable(); | 3741 | find_usable_zone_for_movable(); |
3730 | usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone]; | 3742 | usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone]; |
3731 | 3743 | ||
3732 | restart: | 3744 | restart: |
3733 | /* Spread kernelcore memory as evenly as possible throughout nodes */ | 3745 | /* Spread kernelcore memory as evenly as possible throughout nodes */ |
3734 | kernelcore_node = required_kernelcore / usable_nodes; | 3746 | kernelcore_node = required_kernelcore / usable_nodes; |
3735 | for_each_node_state(nid, N_HIGH_MEMORY) { | 3747 | for_each_node_state(nid, N_HIGH_MEMORY) { |
3736 | /* | 3748 | /* |
3737 | * Recalculate kernelcore_node if the division per node | 3749 | * Recalculate kernelcore_node if the division per node |
3738 | * now exceeds what is necessary to satisfy the requested | 3750 | * now exceeds what is necessary to satisfy the requested |
3739 | * amount of memory for the kernel | 3751 | * amount of memory for the kernel |
3740 | */ | 3752 | */ |
3741 | if (required_kernelcore < kernelcore_node) | 3753 | if (required_kernelcore < kernelcore_node) |
3742 | kernelcore_node = required_kernelcore / usable_nodes; | 3754 | kernelcore_node = required_kernelcore / usable_nodes; |
3743 | 3755 | ||
3744 | /* | 3756 | /* |
3745 | * As the map is walked, we track how much memory is usable | 3757 | * As the map is walked, we track how much memory is usable |
3746 | * by the kernel using kernelcore_remaining. When it is | 3758 | * by the kernel using kernelcore_remaining. When it is |
3747 | * 0, the rest of the node is usable by ZONE_MOVABLE | 3759 | * 0, the rest of the node is usable by ZONE_MOVABLE |
3748 | */ | 3760 | */ |
3749 | kernelcore_remaining = kernelcore_node; | 3761 | kernelcore_remaining = kernelcore_node; |
3750 | 3762 | ||
3751 | /* Go through each range of PFNs within this node */ | 3763 | /* Go through each range of PFNs within this node */ |
3752 | for_each_active_range_index_in_nid(i, nid) { | 3764 | for_each_active_range_index_in_nid(i, nid) { |
3753 | unsigned long start_pfn, end_pfn; | 3765 | unsigned long start_pfn, end_pfn; |
3754 | unsigned long size_pages; | 3766 | unsigned long size_pages; |
3755 | 3767 | ||
3756 | start_pfn = max(early_node_map[i].start_pfn, | 3768 | start_pfn = max(early_node_map[i].start_pfn, |
3757 | zone_movable_pfn[nid]); | 3769 | zone_movable_pfn[nid]); |
3758 | end_pfn = early_node_map[i].end_pfn; | 3770 | end_pfn = early_node_map[i].end_pfn; |
3759 | if (start_pfn >= end_pfn) | 3771 | if (start_pfn >= end_pfn) |
3760 | continue; | 3772 | continue; |
3761 | 3773 | ||
3762 | /* Account for what is only usable for kernelcore */ | 3774 | /* Account for what is only usable for kernelcore */ |
3763 | if (start_pfn < usable_startpfn) { | 3775 | if (start_pfn < usable_startpfn) { |
3764 | unsigned long kernel_pages; | 3776 | unsigned long kernel_pages; |
3765 | kernel_pages = min(end_pfn, usable_startpfn) | 3777 | kernel_pages = min(end_pfn, usable_startpfn) |
3766 | - start_pfn; | 3778 | - start_pfn; |
3767 | 3779 | ||
3768 | kernelcore_remaining -= min(kernel_pages, | 3780 | kernelcore_remaining -= min(kernel_pages, |
3769 | kernelcore_remaining); | 3781 | kernelcore_remaining); |
3770 | required_kernelcore -= min(kernel_pages, | 3782 | required_kernelcore -= min(kernel_pages, |
3771 | required_kernelcore); | 3783 | required_kernelcore); |
3772 | 3784 | ||
3773 | /* Continue if range is now fully accounted */ | 3785 | /* Continue if range is now fully accounted */ |
3774 | if (end_pfn <= usable_startpfn) { | 3786 | if (end_pfn <= usable_startpfn) { |
3775 | 3787 | ||
3776 | /* | 3788 | /* |
3777 | * Push zone_movable_pfn to the end so | 3789 | * Push zone_movable_pfn to the end so |
3778 | * that if we have to rebalance | 3790 | * that if we have to rebalance |
3779 | * kernelcore across nodes, we will | 3791 | * kernelcore across nodes, we will |
3780 | * not double account here | 3792 | * not double account here |
3781 | */ | 3793 | */ |
3782 | zone_movable_pfn[nid] = end_pfn; | 3794 | zone_movable_pfn[nid] = end_pfn; |
3783 | continue; | 3795 | continue; |
3784 | } | 3796 | } |
3785 | start_pfn = usable_startpfn; | 3797 | start_pfn = usable_startpfn; |
3786 | } | 3798 | } |
3787 | 3799 | ||
3788 | /* | 3800 | /* |
3789 | * The usable PFN range for ZONE_MOVABLE is from | 3801 | * The usable PFN range for ZONE_MOVABLE is from |
3790 | * start_pfn->end_pfn. Calculate size_pages as the | 3802 | * start_pfn->end_pfn. Calculate size_pages as the |
3791 | * number of pages used as kernelcore | 3803 | * number of pages used as kernelcore |
3792 | */ | 3804 | */ |
3793 | size_pages = end_pfn - start_pfn; | 3805 | size_pages = end_pfn - start_pfn; |
3794 | if (size_pages > kernelcore_remaining) | 3806 | if (size_pages > kernelcore_remaining) |
3795 | size_pages = kernelcore_remaining; | 3807 | size_pages = kernelcore_remaining; |
3796 | zone_movable_pfn[nid] = start_pfn + size_pages; | 3808 | zone_movable_pfn[nid] = start_pfn + size_pages; |
3797 | 3809 | ||
3798 | /* | 3810 | /* |
3799 | * Some kernelcore has been met, update counts and | 3811 | * Some kernelcore has been met, update counts and |
3800 | * break if the kernelcore for this node has been | 3812 | * break if the kernelcore for this node has been |
3801 | * satisified | 3813 | * satisified |
3802 | */ | 3814 | */ |
3803 | required_kernelcore -= min(required_kernelcore, | 3815 | required_kernelcore -= min(required_kernelcore, |
3804 | size_pages); | 3816 | size_pages); |
3805 | kernelcore_remaining -= size_pages; | 3817 | kernelcore_remaining -= size_pages; |
3806 | if (!kernelcore_remaining) | 3818 | if (!kernelcore_remaining) |
3807 | break; | 3819 | break; |
3808 | } | 3820 | } |
3809 | } | 3821 | } |
3810 | 3822 | ||
3811 | /* | 3823 | /* |
3812 | * If there is still required_kernelcore, we do another pass with one | 3824 | * If there is still required_kernelcore, we do another pass with one |
3813 | * less node in the count. This will push zone_movable_pfn[nid] further | 3825 | * less node in the count. This will push zone_movable_pfn[nid] further |
3814 | * along on the nodes that still have memory until kernelcore is | 3826 | * along on the nodes that still have memory until kernelcore is |
3815 | * satisified | 3827 | * satisified |
3816 | */ | 3828 | */ |
3817 | usable_nodes--; | 3829 | usable_nodes--; |
3818 | if (usable_nodes && required_kernelcore > usable_nodes) | 3830 | if (usable_nodes && required_kernelcore > usable_nodes) |
3819 | goto restart; | 3831 | goto restart; |
3820 | 3832 | ||
3821 | /* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */ | 3833 | /* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */ |
3822 | for (nid = 0; nid < MAX_NUMNODES; nid++) | 3834 | for (nid = 0; nid < MAX_NUMNODES; nid++) |
3823 | zone_movable_pfn[nid] = | 3835 | zone_movable_pfn[nid] = |
3824 | roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES); | 3836 | roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES); |
3825 | } | 3837 | } |
3826 | 3838 | ||
3827 | /* Any regular memory on that node ? */ | 3839 | /* Any regular memory on that node ? */ |
3828 | static void check_for_regular_memory(pg_data_t *pgdat) | 3840 | static void check_for_regular_memory(pg_data_t *pgdat) |
3829 | { | 3841 | { |
3830 | #ifdef CONFIG_HIGHMEM | 3842 | #ifdef CONFIG_HIGHMEM |
3831 | enum zone_type zone_type; | 3843 | enum zone_type zone_type; |
3832 | 3844 | ||
3833 | for (zone_type = 0; zone_type <= ZONE_NORMAL; zone_type++) { | 3845 | for (zone_type = 0; zone_type <= ZONE_NORMAL; zone_type++) { |
3834 | struct zone *zone = &pgdat->node_zones[zone_type]; | 3846 | struct zone *zone = &pgdat->node_zones[zone_type]; |
3835 | if (zone->present_pages) | 3847 | if (zone->present_pages) |
3836 | node_set_state(zone_to_nid(zone), N_NORMAL_MEMORY); | 3848 | node_set_state(zone_to_nid(zone), N_NORMAL_MEMORY); |
3837 | } | 3849 | } |
3838 | #endif | 3850 | #endif |
3839 | } | 3851 | } |
3840 | 3852 | ||
3841 | /** | 3853 | /** |
3842 | * free_area_init_nodes - Initialise all pg_data_t and zone data | 3854 | * free_area_init_nodes - Initialise all pg_data_t and zone data |
3843 | * @max_zone_pfn: an array of max PFNs for each zone | 3855 | * @max_zone_pfn: an array of max PFNs for each zone |
3844 | * | 3856 | * |
3845 | * This will call free_area_init_node() for each active node in the system. | 3857 | * This will call free_area_init_node() for each active node in the system. |
3846 | * Using the page ranges provided by add_active_range(), the size of each | 3858 | * Using the page ranges provided by add_active_range(), the size of each |
3847 | * zone in each node and their holes is calculated. If the maximum PFN | 3859 | * zone in each node and their holes is calculated. If the maximum PFN |
3848 | * between two adjacent zones match, it is assumed that the zone is empty. | 3860 | * between two adjacent zones match, it is assumed that the zone is empty. |
3849 | * For example, if arch_max_dma_pfn == arch_max_dma32_pfn, it is assumed | 3861 | * For example, if arch_max_dma_pfn == arch_max_dma32_pfn, it is assumed |
3850 | * that arch_max_dma32_pfn has no pages. It is also assumed that a zone | 3862 | * that arch_max_dma32_pfn has no pages. It is also assumed that a zone |
3851 | * starts where the previous one ended. For example, ZONE_DMA32 starts | 3863 | * starts where the previous one ended. For example, ZONE_DMA32 starts |
3852 | * at arch_max_dma_pfn. | 3864 | * at arch_max_dma_pfn. |
3853 | */ | 3865 | */ |
3854 | void __init free_area_init_nodes(unsigned long *max_zone_pfn) | 3866 | void __init free_area_init_nodes(unsigned long *max_zone_pfn) |
3855 | { | 3867 | { |
3856 | unsigned long nid; | 3868 | unsigned long nid; |
3857 | enum zone_type i; | 3869 | enum zone_type i; |
3858 | 3870 | ||
3859 | /* Sort early_node_map as initialisation assumes it is sorted */ | 3871 | /* Sort early_node_map as initialisation assumes it is sorted */ |
3860 | sort_node_map(); | 3872 | sort_node_map(); |
3861 | 3873 | ||
3862 | /* Record where the zone boundaries are */ | 3874 | /* Record where the zone boundaries are */ |
3863 | memset(arch_zone_lowest_possible_pfn, 0, | 3875 | memset(arch_zone_lowest_possible_pfn, 0, |
3864 | sizeof(arch_zone_lowest_possible_pfn)); | 3876 | sizeof(arch_zone_lowest_possible_pfn)); |
3865 | memset(arch_zone_highest_possible_pfn, 0, | 3877 | memset(arch_zone_highest_possible_pfn, 0, |
3866 | sizeof(arch_zone_highest_possible_pfn)); | 3878 | sizeof(arch_zone_highest_possible_pfn)); |
3867 | arch_zone_lowest_possible_pfn[0] = find_min_pfn_with_active_regions(); | 3879 | arch_zone_lowest_possible_pfn[0] = find_min_pfn_with_active_regions(); |
3868 | arch_zone_highest_possible_pfn[0] = max_zone_pfn[0]; | 3880 | arch_zone_highest_possible_pfn[0] = max_zone_pfn[0]; |
3869 | for (i = 1; i < MAX_NR_ZONES; i++) { | 3881 | for (i = 1; i < MAX_NR_ZONES; i++) { |
3870 | if (i == ZONE_MOVABLE) | 3882 | if (i == ZONE_MOVABLE) |
3871 | continue; | 3883 | continue; |
3872 | arch_zone_lowest_possible_pfn[i] = | 3884 | arch_zone_lowest_possible_pfn[i] = |
3873 | arch_zone_highest_possible_pfn[i-1]; | 3885 | arch_zone_highest_possible_pfn[i-1]; |
3874 | arch_zone_highest_possible_pfn[i] = | 3886 | arch_zone_highest_possible_pfn[i] = |
3875 | max(max_zone_pfn[i], arch_zone_lowest_possible_pfn[i]); | 3887 | max(max_zone_pfn[i], arch_zone_lowest_possible_pfn[i]); |
3876 | } | 3888 | } |
3877 | arch_zone_lowest_possible_pfn[ZONE_MOVABLE] = 0; | 3889 | arch_zone_lowest_possible_pfn[ZONE_MOVABLE] = 0; |
3878 | arch_zone_highest_possible_pfn[ZONE_MOVABLE] = 0; | 3890 | arch_zone_highest_possible_pfn[ZONE_MOVABLE] = 0; |
3879 | 3891 | ||
3880 | /* Find the PFNs that ZONE_MOVABLE begins at in each node */ | 3892 | /* Find the PFNs that ZONE_MOVABLE begins at in each node */ |
3881 | memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn)); | 3893 | memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn)); |
3882 | find_zone_movable_pfns_for_nodes(zone_movable_pfn); | 3894 | find_zone_movable_pfns_for_nodes(zone_movable_pfn); |
3883 | 3895 | ||
3884 | /* Print out the zone ranges */ | 3896 | /* Print out the zone ranges */ |
3885 | printk("Zone PFN ranges:\n"); | 3897 | printk("Zone PFN ranges:\n"); |
3886 | for (i = 0; i < MAX_NR_ZONES; i++) { | 3898 | for (i = 0; i < MAX_NR_ZONES; i++) { |
3887 | if (i == ZONE_MOVABLE) | 3899 | if (i == ZONE_MOVABLE) |
3888 | continue; | 3900 | continue; |
3889 | printk(" %-8s %8lu -> %8lu\n", | 3901 | printk(" %-8s %8lu -> %8lu\n", |
3890 | zone_names[i], | 3902 | zone_names[i], |
3891 | arch_zone_lowest_possible_pfn[i], | 3903 | arch_zone_lowest_possible_pfn[i], |
3892 | arch_zone_highest_possible_pfn[i]); | 3904 | arch_zone_highest_possible_pfn[i]); |
3893 | } | 3905 | } |
3894 | 3906 | ||
3895 | /* Print out the PFNs ZONE_MOVABLE begins at in each node */ | 3907 | /* Print out the PFNs ZONE_MOVABLE begins at in each node */ |
3896 | printk("Movable zone start PFN for each node\n"); | 3908 | printk("Movable zone start PFN for each node\n"); |
3897 | for (i = 0; i < MAX_NUMNODES; i++) { | 3909 | for (i = 0; i < MAX_NUMNODES; i++) { |
3898 | if (zone_movable_pfn[i]) | 3910 | if (zone_movable_pfn[i]) |
3899 | printk(" Node %d: %lu\n", i, zone_movable_pfn[i]); | 3911 | printk(" Node %d: %lu\n", i, zone_movable_pfn[i]); |
3900 | } | 3912 | } |
3901 | 3913 | ||
3902 | /* Print out the early_node_map[] */ | 3914 | /* Print out the early_node_map[] */ |
3903 | printk("early_node_map[%d] active PFN ranges\n", nr_nodemap_entries); | 3915 | printk("early_node_map[%d] active PFN ranges\n", nr_nodemap_entries); |
3904 | for (i = 0; i < nr_nodemap_entries; i++) | 3916 | for (i = 0; i < nr_nodemap_entries; i++) |
3905 | printk(" %3d: %8lu -> %8lu\n", early_node_map[i].nid, | 3917 | printk(" %3d: %8lu -> %8lu\n", early_node_map[i].nid, |
3906 | early_node_map[i].start_pfn, | 3918 | early_node_map[i].start_pfn, |
3907 | early_node_map[i].end_pfn); | 3919 | early_node_map[i].end_pfn); |
3908 | 3920 | ||
3909 | /* Initialise every node */ | 3921 | /* Initialise every node */ |
3910 | setup_nr_node_ids(); | 3922 | setup_nr_node_ids(); |
3911 | for_each_online_node(nid) { | 3923 | for_each_online_node(nid) { |
3912 | pg_data_t *pgdat = NODE_DATA(nid); | 3924 | pg_data_t *pgdat = NODE_DATA(nid); |
3913 | free_area_init_node(nid, pgdat, NULL, | 3925 | free_area_init_node(nid, pgdat, NULL, |
3914 | find_min_pfn_for_node(nid), NULL); | 3926 | find_min_pfn_for_node(nid), NULL); |
3915 | 3927 | ||
3916 | /* Any memory on that node */ | 3928 | /* Any memory on that node */ |
3917 | if (pgdat->node_present_pages) | 3929 | if (pgdat->node_present_pages) |
3918 | node_set_state(nid, N_HIGH_MEMORY); | 3930 | node_set_state(nid, N_HIGH_MEMORY); |
3919 | check_for_regular_memory(pgdat); | 3931 | check_for_regular_memory(pgdat); |
3920 | } | 3932 | } |
3921 | } | 3933 | } |
3922 | 3934 | ||
3923 | static int __init cmdline_parse_core(char *p, unsigned long *core) | 3935 | static int __init cmdline_parse_core(char *p, unsigned long *core) |
3924 | { | 3936 | { |
3925 | unsigned long long coremem; | 3937 | unsigned long long coremem; |
3926 | if (!p) | 3938 | if (!p) |
3927 | return -EINVAL; | 3939 | return -EINVAL; |
3928 | 3940 | ||
3929 | coremem = memparse(p, &p); | 3941 | coremem = memparse(p, &p); |
3930 | *core = coremem >> PAGE_SHIFT; | 3942 | *core = coremem >> PAGE_SHIFT; |
3931 | 3943 | ||
3932 | /* Paranoid check that UL is enough for the coremem value */ | 3944 | /* Paranoid check that UL is enough for the coremem value */ |
3933 | WARN_ON((coremem >> PAGE_SHIFT) > ULONG_MAX); | 3945 | WARN_ON((coremem >> PAGE_SHIFT) > ULONG_MAX); |
3934 | 3946 | ||
3935 | return 0; | 3947 | return 0; |
3936 | } | 3948 | } |
3937 | 3949 | ||
3938 | /* | 3950 | /* |
3939 | * kernelcore=size sets the amount of memory for use for allocations that | 3951 | * kernelcore=size sets the amount of memory for use for allocations that |
3940 | * cannot be reclaimed or migrated. | 3952 | * cannot be reclaimed or migrated. |
3941 | */ | 3953 | */ |
3942 | static int __init cmdline_parse_kernelcore(char *p) | 3954 | static int __init cmdline_parse_kernelcore(char *p) |
3943 | { | 3955 | { |
3944 | return cmdline_parse_core(p, &required_kernelcore); | 3956 | return cmdline_parse_core(p, &required_kernelcore); |
3945 | } | 3957 | } |
3946 | 3958 | ||
3947 | /* | 3959 | /* |
3948 | * movablecore=size sets the amount of memory for use for allocations that | 3960 | * movablecore=size sets the amount of memory for use for allocations that |
3949 | * can be reclaimed or migrated. | 3961 | * can be reclaimed or migrated. |
3950 | */ | 3962 | */ |
3951 | static int __init cmdline_parse_movablecore(char *p) | 3963 | static int __init cmdline_parse_movablecore(char *p) |
3952 | { | 3964 | { |
3953 | return cmdline_parse_core(p, &required_movablecore); | 3965 | return cmdline_parse_core(p, &required_movablecore); |
3954 | } | 3966 | } |
3955 | 3967 | ||
3956 | early_param("kernelcore", cmdline_parse_kernelcore); | 3968 | early_param("kernelcore", cmdline_parse_kernelcore); |
3957 | early_param("movablecore", cmdline_parse_movablecore); | 3969 | early_param("movablecore", cmdline_parse_movablecore); |
3958 | 3970 | ||
3959 | #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */ | 3971 | #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */ |
3960 | 3972 | ||
3961 | /** | 3973 | /** |
3962 | * set_dma_reserve - set the specified number of pages reserved in the first zone | 3974 | * set_dma_reserve - set the specified number of pages reserved in the first zone |
3963 | * @new_dma_reserve: The number of pages to mark reserved | 3975 | * @new_dma_reserve: The number of pages to mark reserved |
3964 | * | 3976 | * |
3965 | * The per-cpu batchsize and zone watermarks are determined by present_pages. | 3977 | * The per-cpu batchsize and zone watermarks are determined by present_pages. |
3966 | * In the DMA zone, a significant percentage may be consumed by kernel image | 3978 | * In the DMA zone, a significant percentage may be consumed by kernel image |
3967 | * and other unfreeable allocations which can skew the watermarks badly. This | 3979 | * and other unfreeable allocations which can skew the watermarks badly. This |
3968 | * function may optionally be used to account for unfreeable pages in the | 3980 | * function may optionally be used to account for unfreeable pages in the |
3969 | * first zone (e.g., ZONE_DMA). The effect will be lower watermarks and | 3981 | * first zone (e.g., ZONE_DMA). The effect will be lower watermarks and |
3970 | * smaller per-cpu batchsize. | 3982 | * smaller per-cpu batchsize. |
3971 | */ | 3983 | */ |
3972 | void __init set_dma_reserve(unsigned long new_dma_reserve) | 3984 | void __init set_dma_reserve(unsigned long new_dma_reserve) |
3973 | { | 3985 | { |
3974 | dma_reserve = new_dma_reserve; | 3986 | dma_reserve = new_dma_reserve; |
3975 | } | 3987 | } |
3976 | 3988 | ||
3977 | #ifndef CONFIG_NEED_MULTIPLE_NODES | 3989 | #ifndef CONFIG_NEED_MULTIPLE_NODES |
3978 | static bootmem_data_t contig_bootmem_data; | 3990 | static bootmem_data_t contig_bootmem_data; |
3979 | struct pglist_data contig_page_data = { .bdata = &contig_bootmem_data }; | 3991 | struct pglist_data contig_page_data = { .bdata = &contig_bootmem_data }; |
3980 | 3992 | ||
3981 | EXPORT_SYMBOL(contig_page_data); | 3993 | EXPORT_SYMBOL(contig_page_data); |
3982 | #endif | 3994 | #endif |
3983 | 3995 | ||
3984 | void __init free_area_init(unsigned long *zones_size) | 3996 | void __init free_area_init(unsigned long *zones_size) |
3985 | { | 3997 | { |
3986 | free_area_init_node(0, NODE_DATA(0), zones_size, | 3998 | free_area_init_node(0, NODE_DATA(0), zones_size, |
3987 | __pa(PAGE_OFFSET) >> PAGE_SHIFT, NULL); | 3999 | __pa(PAGE_OFFSET) >> PAGE_SHIFT, NULL); |
3988 | } | 4000 | } |
3989 | 4001 | ||
3990 | static int page_alloc_cpu_notify(struct notifier_block *self, | 4002 | static int page_alloc_cpu_notify(struct notifier_block *self, |
3991 | unsigned long action, void *hcpu) | 4003 | unsigned long action, void *hcpu) |
3992 | { | 4004 | { |
3993 | int cpu = (unsigned long)hcpu; | 4005 | int cpu = (unsigned long)hcpu; |
3994 | 4006 | ||
3995 | if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) { | 4007 | if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) { |
3996 | drain_pages(cpu); | 4008 | drain_pages(cpu); |
3997 | 4009 | ||
3998 | /* | 4010 | /* |
3999 | * Spill the event counters of the dead processor | 4011 | * Spill the event counters of the dead processor |
4000 | * into the current processors event counters. | 4012 | * into the current processors event counters. |
4001 | * This artificially elevates the count of the current | 4013 | * This artificially elevates the count of the current |
4002 | * processor. | 4014 | * processor. |
4003 | */ | 4015 | */ |
4004 | vm_events_fold_cpu(cpu); | 4016 | vm_events_fold_cpu(cpu); |
4005 | 4017 | ||
4006 | /* | 4018 | /* |
4007 | * Zero the differential counters of the dead processor | 4019 | * Zero the differential counters of the dead processor |
4008 | * so that the vm statistics are consistent. | 4020 | * so that the vm statistics are consistent. |
4009 | * | 4021 | * |
4010 | * This is only okay since the processor is dead and cannot | 4022 | * This is only okay since the processor is dead and cannot |
4011 | * race with what we are doing. | 4023 | * race with what we are doing. |
4012 | */ | 4024 | */ |
4013 | refresh_cpu_vm_stats(cpu); | 4025 | refresh_cpu_vm_stats(cpu); |
4014 | } | 4026 | } |
4015 | return NOTIFY_OK; | 4027 | return NOTIFY_OK; |
4016 | } | 4028 | } |
4017 | 4029 | ||
4018 | void __init page_alloc_init(void) | 4030 | void __init page_alloc_init(void) |
4019 | { | 4031 | { |
4020 | hotcpu_notifier(page_alloc_cpu_notify, 0); | 4032 | hotcpu_notifier(page_alloc_cpu_notify, 0); |
4021 | } | 4033 | } |
4022 | 4034 | ||
4023 | /* | 4035 | /* |
4024 | * calculate_totalreserve_pages - called when sysctl_lower_zone_reserve_ratio | 4036 | * calculate_totalreserve_pages - called when sysctl_lower_zone_reserve_ratio |
4025 | * or min_free_kbytes changes. | 4037 | * or min_free_kbytes changes. |
4026 | */ | 4038 | */ |
4027 | static void calculate_totalreserve_pages(void) | 4039 | static void calculate_totalreserve_pages(void) |
4028 | { | 4040 | { |
4029 | struct pglist_data *pgdat; | 4041 | struct pglist_data *pgdat; |
4030 | unsigned long reserve_pages = 0; | 4042 | unsigned long reserve_pages = 0; |
4031 | enum zone_type i, j; | 4043 | enum zone_type i, j; |
4032 | 4044 | ||
4033 | for_each_online_pgdat(pgdat) { | 4045 | for_each_online_pgdat(pgdat) { |
4034 | for (i = 0; i < MAX_NR_ZONES; i++) { | 4046 | for (i = 0; i < MAX_NR_ZONES; i++) { |
4035 | struct zone *zone = pgdat->node_zones + i; | 4047 | struct zone *zone = pgdat->node_zones + i; |
4036 | unsigned long max = 0; | 4048 | unsigned long max = 0; |
4037 | 4049 | ||
4038 | /* Find valid and maximum lowmem_reserve in the zone */ | 4050 | /* Find valid and maximum lowmem_reserve in the zone */ |
4039 | for (j = i; j < MAX_NR_ZONES; j++) { | 4051 | for (j = i; j < MAX_NR_ZONES; j++) { |
4040 | if (zone->lowmem_reserve[j] > max) | 4052 | if (zone->lowmem_reserve[j] > max) |
4041 | max = zone->lowmem_reserve[j]; | 4053 | max = zone->lowmem_reserve[j]; |
4042 | } | 4054 | } |
4043 | 4055 | ||
4044 | /* we treat pages_high as reserved pages. */ | 4056 | /* we treat pages_high as reserved pages. */ |
4045 | max += zone->pages_high; | 4057 | max += zone->pages_high; |
4046 | 4058 | ||
4047 | if (max > zone->present_pages) | 4059 | if (max > zone->present_pages) |
4048 | max = zone->present_pages; | 4060 | max = zone->present_pages; |
4049 | reserve_pages += max; | 4061 | reserve_pages += max; |
4050 | } | 4062 | } |
4051 | } | 4063 | } |
4052 | totalreserve_pages = reserve_pages; | 4064 | totalreserve_pages = reserve_pages; |
4053 | } | 4065 | } |
4054 | 4066 | ||
4055 | /* | 4067 | /* |
4056 | * setup_per_zone_lowmem_reserve - called whenever | 4068 | * setup_per_zone_lowmem_reserve - called whenever |
4057 | * sysctl_lower_zone_reserve_ratio changes. Ensures that each zone | 4069 | * sysctl_lower_zone_reserve_ratio changes. Ensures that each zone |
4058 | * has a correct pages reserved value, so an adequate number of | 4070 | * has a correct pages reserved value, so an adequate number of |
4059 | * pages are left in the zone after a successful __alloc_pages(). | 4071 | * pages are left in the zone after a successful __alloc_pages(). |
4060 | */ | 4072 | */ |
4061 | static void setup_per_zone_lowmem_reserve(void) | 4073 | static void setup_per_zone_lowmem_reserve(void) |
4062 | { | 4074 | { |
4063 | struct pglist_data *pgdat; | 4075 | struct pglist_data *pgdat; |
4064 | enum zone_type j, idx; | 4076 | enum zone_type j, idx; |
4065 | 4077 | ||
4066 | for_each_online_pgdat(pgdat) { | 4078 | for_each_online_pgdat(pgdat) { |
4067 | for (j = 0; j < MAX_NR_ZONES; j++) { | 4079 | for (j = 0; j < MAX_NR_ZONES; j++) { |
4068 | struct zone *zone = pgdat->node_zones + j; | 4080 | struct zone *zone = pgdat->node_zones + j; |
4069 | unsigned long present_pages = zone->present_pages; | 4081 | unsigned long present_pages = zone->present_pages; |
4070 | 4082 | ||
4071 | zone->lowmem_reserve[j] = 0; | 4083 | zone->lowmem_reserve[j] = 0; |
4072 | 4084 | ||
4073 | idx = j; | 4085 | idx = j; |
4074 | while (idx) { | 4086 | while (idx) { |
4075 | struct zone *lower_zone; | 4087 | struct zone *lower_zone; |
4076 | 4088 | ||
4077 | idx--; | 4089 | idx--; |
4078 | 4090 | ||
4079 | if (sysctl_lowmem_reserve_ratio[idx] < 1) | 4091 | if (sysctl_lowmem_reserve_ratio[idx] < 1) |
4080 | sysctl_lowmem_reserve_ratio[idx] = 1; | 4092 | sysctl_lowmem_reserve_ratio[idx] = 1; |
4081 | 4093 | ||
4082 | lower_zone = pgdat->node_zones + idx; | 4094 | lower_zone = pgdat->node_zones + idx; |
4083 | lower_zone->lowmem_reserve[j] = present_pages / | 4095 | lower_zone->lowmem_reserve[j] = present_pages / |
4084 | sysctl_lowmem_reserve_ratio[idx]; | 4096 | sysctl_lowmem_reserve_ratio[idx]; |
4085 | present_pages += lower_zone->present_pages; | 4097 | present_pages += lower_zone->present_pages; |
4086 | } | 4098 | } |
4087 | } | 4099 | } |
4088 | } | 4100 | } |
4089 | 4101 | ||
4090 | /* update totalreserve_pages */ | 4102 | /* update totalreserve_pages */ |
4091 | calculate_totalreserve_pages(); | 4103 | calculate_totalreserve_pages(); |
4092 | } | 4104 | } |
4093 | 4105 | ||
4094 | /** | 4106 | /** |
4095 | * setup_per_zone_pages_min - called when min_free_kbytes changes. | 4107 | * setup_per_zone_pages_min - called when min_free_kbytes changes. |
4096 | * | 4108 | * |
4097 | * Ensures that the pages_{min,low,high} values for each zone are set correctly | 4109 | * Ensures that the pages_{min,low,high} values for each zone are set correctly |
4098 | * with respect to min_free_kbytes. | 4110 | * with respect to min_free_kbytes. |
4099 | */ | 4111 | */ |
4100 | void setup_per_zone_pages_min(void) | 4112 | void setup_per_zone_pages_min(void) |
4101 | { | 4113 | { |
4102 | unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10); | 4114 | unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10); |
4103 | unsigned long lowmem_pages = 0; | 4115 | unsigned long lowmem_pages = 0; |
4104 | struct zone *zone; | 4116 | struct zone *zone; |
4105 | unsigned long flags; | 4117 | unsigned long flags; |
4106 | 4118 | ||
4107 | /* Calculate total number of !ZONE_HIGHMEM pages */ | 4119 | /* Calculate total number of !ZONE_HIGHMEM pages */ |
4108 | for_each_zone(zone) { | 4120 | for_each_zone(zone) { |
4109 | if (!is_highmem(zone)) | 4121 | if (!is_highmem(zone)) |
4110 | lowmem_pages += zone->present_pages; | 4122 | lowmem_pages += zone->present_pages; |
4111 | } | 4123 | } |
4112 | 4124 | ||
4113 | for_each_zone(zone) { | 4125 | for_each_zone(zone) { |
4114 | u64 tmp; | 4126 | u64 tmp; |
4115 | 4127 | ||
4116 | spin_lock_irqsave(&zone->lru_lock, flags); | 4128 | spin_lock_irqsave(&zone->lru_lock, flags); |
4117 | tmp = (u64)pages_min * zone->present_pages; | 4129 | tmp = (u64)pages_min * zone->present_pages; |
4118 | do_div(tmp, lowmem_pages); | 4130 | do_div(tmp, lowmem_pages); |
4119 | if (is_highmem(zone)) { | 4131 | if (is_highmem(zone)) { |
4120 | /* | 4132 | /* |
4121 | * __GFP_HIGH and PF_MEMALLOC allocations usually don't | 4133 | * __GFP_HIGH and PF_MEMALLOC allocations usually don't |
4122 | * need highmem pages, so cap pages_min to a small | 4134 | * need highmem pages, so cap pages_min to a small |
4123 | * value here. | 4135 | * value here. |
4124 | * | 4136 | * |
4125 | * The (pages_high-pages_low) and (pages_low-pages_min) | 4137 | * The (pages_high-pages_low) and (pages_low-pages_min) |
4126 | * deltas controls asynch page reclaim, and so should | 4138 | * deltas controls asynch page reclaim, and so should |
4127 | * not be capped for highmem. | 4139 | * not be capped for highmem. |
4128 | */ | 4140 | */ |
4129 | int min_pages; | 4141 | int min_pages; |
4130 | 4142 | ||
4131 | min_pages = zone->present_pages / 1024; | 4143 | min_pages = zone->present_pages / 1024; |
4132 | if (min_pages < SWAP_CLUSTER_MAX) | 4144 | if (min_pages < SWAP_CLUSTER_MAX) |
4133 | min_pages = SWAP_CLUSTER_MAX; | 4145 | min_pages = SWAP_CLUSTER_MAX; |
4134 | if (min_pages > 128) | 4146 | if (min_pages > 128) |
4135 | min_pages = 128; | 4147 | min_pages = 128; |
4136 | zone->pages_min = min_pages; | 4148 | zone->pages_min = min_pages; |
4137 | } else { | 4149 | } else { |
4138 | /* | 4150 | /* |
4139 | * If it's a lowmem zone, reserve a number of pages | 4151 | * If it's a lowmem zone, reserve a number of pages |
4140 | * proportionate to the zone's size. | 4152 | * proportionate to the zone's size. |
4141 | */ | 4153 | */ |
4142 | zone->pages_min = tmp; | 4154 | zone->pages_min = tmp; |
4143 | } | 4155 | } |
4144 | 4156 | ||
4145 | zone->pages_low = zone->pages_min + (tmp >> 2); | 4157 | zone->pages_low = zone->pages_min + (tmp >> 2); |
4146 | zone->pages_high = zone->pages_min + (tmp >> 1); | 4158 | zone->pages_high = zone->pages_min + (tmp >> 1); |
4147 | setup_zone_migrate_reserve(zone); | 4159 | setup_zone_migrate_reserve(zone); |
4148 | spin_unlock_irqrestore(&zone->lru_lock, flags); | 4160 | spin_unlock_irqrestore(&zone->lru_lock, flags); |
4149 | } | 4161 | } |
4150 | 4162 | ||
4151 | /* update totalreserve_pages */ | 4163 | /* update totalreserve_pages */ |
4152 | calculate_totalreserve_pages(); | 4164 | calculate_totalreserve_pages(); |
4153 | } | 4165 | } |
4154 | 4166 | ||
4155 | /* | 4167 | /* |
4156 | * Initialise min_free_kbytes. | 4168 | * Initialise min_free_kbytes. |
4157 | * | 4169 | * |
4158 | * For small machines we want it small (128k min). For large machines | 4170 | * For small machines we want it small (128k min). For large machines |
4159 | * we want it large (64MB max). But it is not linear, because network | 4171 | * we want it large (64MB max). But it is not linear, because network |
4160 | * bandwidth does not increase linearly with machine size. We use | 4172 | * bandwidth does not increase linearly with machine size. We use |
4161 | * | 4173 | * |
4162 | * min_free_kbytes = 4 * sqrt(lowmem_kbytes), for better accuracy: | 4174 | * min_free_kbytes = 4 * sqrt(lowmem_kbytes), for better accuracy: |
4163 | * min_free_kbytes = sqrt(lowmem_kbytes * 16) | 4175 | * min_free_kbytes = sqrt(lowmem_kbytes * 16) |
4164 | * | 4176 | * |
4165 | * which yields | 4177 | * which yields |
4166 | * | 4178 | * |
4167 | * 16MB: 512k | 4179 | * 16MB: 512k |
4168 | * 32MB: 724k | 4180 | * 32MB: 724k |
4169 | * 64MB: 1024k | 4181 | * 64MB: 1024k |
4170 | * 128MB: 1448k | 4182 | * 128MB: 1448k |
4171 | * 256MB: 2048k | 4183 | * 256MB: 2048k |
4172 | * 512MB: 2896k | 4184 | * 512MB: 2896k |
4173 | * 1024MB: 4096k | 4185 | * 1024MB: 4096k |
4174 | * 2048MB: 5792k | 4186 | * 2048MB: 5792k |
4175 | * 4096MB: 8192k | 4187 | * 4096MB: 8192k |
4176 | * 8192MB: 11584k | 4188 | * 8192MB: 11584k |
4177 | * 16384MB: 16384k | 4189 | * 16384MB: 16384k |
4178 | */ | 4190 | */ |
4179 | static int __init init_per_zone_pages_min(void) | 4191 | static int __init init_per_zone_pages_min(void) |
4180 | { | 4192 | { |
4181 | unsigned long lowmem_kbytes; | 4193 | unsigned long lowmem_kbytes; |
4182 | 4194 | ||
4183 | lowmem_kbytes = nr_free_buffer_pages() * (PAGE_SIZE >> 10); | 4195 | lowmem_kbytes = nr_free_buffer_pages() * (PAGE_SIZE >> 10); |
4184 | 4196 | ||
4185 | min_free_kbytes = int_sqrt(lowmem_kbytes * 16); | 4197 | min_free_kbytes = int_sqrt(lowmem_kbytes * 16); |
4186 | if (min_free_kbytes < 128) | 4198 | if (min_free_kbytes < 128) |
4187 | min_free_kbytes = 128; | 4199 | min_free_kbytes = 128; |
4188 | if (min_free_kbytes > 65536) | 4200 | if (min_free_kbytes > 65536) |
4189 | min_free_kbytes = 65536; | 4201 | min_free_kbytes = 65536; |
4190 | setup_per_zone_pages_min(); | 4202 | setup_per_zone_pages_min(); |
4191 | setup_per_zone_lowmem_reserve(); | 4203 | setup_per_zone_lowmem_reserve(); |
4192 | return 0; | 4204 | return 0; |
4193 | } | 4205 | } |
4194 | module_init(init_per_zone_pages_min) | 4206 | module_init(init_per_zone_pages_min) |
4195 | 4207 | ||
4196 | /* | 4208 | /* |
4197 | * min_free_kbytes_sysctl_handler - just a wrapper around proc_dointvec() so | 4209 | * min_free_kbytes_sysctl_handler - just a wrapper around proc_dointvec() so |
4198 | * that we can call two helper functions whenever min_free_kbytes | 4210 | * that we can call two helper functions whenever min_free_kbytes |
4199 | * changes. | 4211 | * changes. |
4200 | */ | 4212 | */ |
4201 | int min_free_kbytes_sysctl_handler(ctl_table *table, int write, | 4213 | int min_free_kbytes_sysctl_handler(ctl_table *table, int write, |
4202 | struct file *file, void __user *buffer, size_t *length, loff_t *ppos) | 4214 | struct file *file, void __user *buffer, size_t *length, loff_t *ppos) |
4203 | { | 4215 | { |
4204 | proc_dointvec(table, write, file, buffer, length, ppos); | 4216 | proc_dointvec(table, write, file, buffer, length, ppos); |
4205 | if (write) | 4217 | if (write) |
4206 | setup_per_zone_pages_min(); | 4218 | setup_per_zone_pages_min(); |
4207 | return 0; | 4219 | return 0; |
4208 | } | 4220 | } |
4209 | 4221 | ||
4210 | #ifdef CONFIG_NUMA | 4222 | #ifdef CONFIG_NUMA |
4211 | int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write, | 4223 | int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write, |
4212 | struct file *file, void __user *buffer, size_t *length, loff_t *ppos) | 4224 | struct file *file, void __user *buffer, size_t *length, loff_t *ppos) |
4213 | { | 4225 | { |
4214 | struct zone *zone; | 4226 | struct zone *zone; |
4215 | int rc; | 4227 | int rc; |
4216 | 4228 | ||
4217 | rc = proc_dointvec_minmax(table, write, file, buffer, length, ppos); | 4229 | rc = proc_dointvec_minmax(table, write, file, buffer, length, ppos); |
4218 | if (rc) | 4230 | if (rc) |
4219 | return rc; | 4231 | return rc; |
4220 | 4232 | ||
4221 | for_each_zone(zone) | 4233 | for_each_zone(zone) |
4222 | zone->min_unmapped_pages = (zone->present_pages * | 4234 | zone->min_unmapped_pages = (zone->present_pages * |
4223 | sysctl_min_unmapped_ratio) / 100; | 4235 | sysctl_min_unmapped_ratio) / 100; |
4224 | return 0; | 4236 | return 0; |
4225 | } | 4237 | } |
4226 | 4238 | ||
4227 | int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write, | 4239 | int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write, |
4228 | struct file *file, void __user *buffer, size_t *length, loff_t *ppos) | 4240 | struct file *file, void __user *buffer, size_t *length, loff_t *ppos) |
4229 | { | 4241 | { |
4230 | struct zone *zone; | 4242 | struct zone *zone; |
4231 | int rc; | 4243 | int rc; |
4232 | 4244 | ||
4233 | rc = proc_dointvec_minmax(table, write, file, buffer, length, ppos); | 4245 | rc = proc_dointvec_minmax(table, write, file, buffer, length, ppos); |
4234 | if (rc) | 4246 | if (rc) |
4235 | return rc; | 4247 | return rc; |
4236 | 4248 | ||
4237 | for_each_zone(zone) | 4249 | for_each_zone(zone) |
4238 | zone->min_slab_pages = (zone->present_pages * | 4250 | zone->min_slab_pages = (zone->present_pages * |
4239 | sysctl_min_slab_ratio) / 100; | 4251 | sysctl_min_slab_ratio) / 100; |
4240 | return 0; | 4252 | return 0; |
4241 | } | 4253 | } |
4242 | #endif | 4254 | #endif |
4243 | 4255 | ||
4244 | /* | 4256 | /* |
4245 | * lowmem_reserve_ratio_sysctl_handler - just a wrapper around | 4257 | * lowmem_reserve_ratio_sysctl_handler - just a wrapper around |
4246 | * proc_dointvec() so that we can call setup_per_zone_lowmem_reserve() | 4258 | * proc_dointvec() so that we can call setup_per_zone_lowmem_reserve() |
4247 | * whenever sysctl_lowmem_reserve_ratio changes. | 4259 | * whenever sysctl_lowmem_reserve_ratio changes. |
4248 | * | 4260 | * |
4249 | * The reserve ratio obviously has absolutely no relation with the | 4261 | * The reserve ratio obviously has absolutely no relation with the |
4250 | * pages_min watermarks. The lowmem reserve ratio can only make sense | 4262 | * pages_min watermarks. The lowmem reserve ratio can only make sense |
4251 | * if in function of the boot time zone sizes. | 4263 | * if in function of the boot time zone sizes. |
4252 | */ | 4264 | */ |
4253 | int lowmem_reserve_ratio_sysctl_handler(ctl_table *table, int write, | 4265 | int lowmem_reserve_ratio_sysctl_handler(ctl_table *table, int write, |
4254 | struct file *file, void __user *buffer, size_t *length, loff_t *ppos) | 4266 | struct file *file, void __user *buffer, size_t *length, loff_t *ppos) |
4255 | { | 4267 | { |
4256 | proc_dointvec_minmax(table, write, file, buffer, length, ppos); | 4268 | proc_dointvec_minmax(table, write, file, buffer, length, ppos); |
4257 | setup_per_zone_lowmem_reserve(); | 4269 | setup_per_zone_lowmem_reserve(); |
4258 | return 0; | 4270 | return 0; |
4259 | } | 4271 | } |
4260 | 4272 | ||
4261 | /* | 4273 | /* |
4262 | * percpu_pagelist_fraction - changes the pcp->high for each zone on each | 4274 | * percpu_pagelist_fraction - changes the pcp->high for each zone on each |
4263 | * cpu. It is the fraction of total pages in each zone that a hot per cpu pagelist | 4275 | * cpu. It is the fraction of total pages in each zone that a hot per cpu pagelist |
4264 | * can have before it gets flushed back to buddy allocator. | 4276 | * can have before it gets flushed back to buddy allocator. |
4265 | */ | 4277 | */ |
4266 | 4278 | ||
4267 | int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write, | 4279 | int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write, |
4268 | struct file *file, void __user *buffer, size_t *length, loff_t *ppos) | 4280 | struct file *file, void __user *buffer, size_t *length, loff_t *ppos) |
4269 | { | 4281 | { |
4270 | struct zone *zone; | 4282 | struct zone *zone; |
4271 | unsigned int cpu; | 4283 | unsigned int cpu; |
4272 | int ret; | 4284 | int ret; |
4273 | 4285 | ||
4274 | ret = proc_dointvec_minmax(table, write, file, buffer, length, ppos); | 4286 | ret = proc_dointvec_minmax(table, write, file, buffer, length, ppos); |
4275 | if (!write || (ret == -EINVAL)) | 4287 | if (!write || (ret == -EINVAL)) |
4276 | return ret; | 4288 | return ret; |
4277 | for_each_zone(zone) { | 4289 | for_each_zone(zone) { |
4278 | for_each_online_cpu(cpu) { | 4290 | for_each_online_cpu(cpu) { |
4279 | unsigned long high; | 4291 | unsigned long high; |
4280 | high = zone->present_pages / percpu_pagelist_fraction; | 4292 | high = zone->present_pages / percpu_pagelist_fraction; |
4281 | setup_pagelist_highmark(zone_pcp(zone, cpu), high); | 4293 | setup_pagelist_highmark(zone_pcp(zone, cpu), high); |
4282 | } | 4294 | } |
4283 | } | 4295 | } |
4284 | return 0; | 4296 | return 0; |
4285 | } | 4297 | } |
4286 | 4298 | ||
4287 | int hashdist = HASHDIST_DEFAULT; | 4299 | int hashdist = HASHDIST_DEFAULT; |
4288 | 4300 | ||
4289 | #ifdef CONFIG_NUMA | 4301 | #ifdef CONFIG_NUMA |
4290 | static int __init set_hashdist(char *str) | 4302 | static int __init set_hashdist(char *str) |
4291 | { | 4303 | { |
4292 | if (!str) | 4304 | if (!str) |
4293 | return 0; | 4305 | return 0; |
4294 | hashdist = simple_strtoul(str, &str, 0); | 4306 | hashdist = simple_strtoul(str, &str, 0); |
4295 | return 1; | 4307 | return 1; |
4296 | } | 4308 | } |
4297 | __setup("hashdist=", set_hashdist); | 4309 | __setup("hashdist=", set_hashdist); |
4298 | #endif | 4310 | #endif |
4299 | 4311 | ||
4300 | /* | 4312 | /* |
4301 | * allocate a large system hash table from bootmem | 4313 | * allocate a large system hash table from bootmem |
4302 | * - it is assumed that the hash table must contain an exact power-of-2 | 4314 | * - it is assumed that the hash table must contain an exact power-of-2 |
4303 | * quantity of entries | 4315 | * quantity of entries |
4304 | * - limit is the number of hash buckets, not the total allocation size | 4316 | * - limit is the number of hash buckets, not the total allocation size |
4305 | */ | 4317 | */ |
4306 | void *__init alloc_large_system_hash(const char *tablename, | 4318 | void *__init alloc_large_system_hash(const char *tablename, |
4307 | unsigned long bucketsize, | 4319 | unsigned long bucketsize, |
4308 | unsigned long numentries, | 4320 | unsigned long numentries, |
4309 | int scale, | 4321 | int scale, |
4310 | int flags, | 4322 | int flags, |
4311 | unsigned int *_hash_shift, | 4323 | unsigned int *_hash_shift, |
4312 | unsigned int *_hash_mask, | 4324 | unsigned int *_hash_mask, |
4313 | unsigned long limit) | 4325 | unsigned long limit) |
4314 | { | 4326 | { |
4315 | unsigned long long max = limit; | 4327 | unsigned long long max = limit; |
4316 | unsigned long log2qty, size; | 4328 | unsigned long log2qty, size; |
4317 | void *table = NULL; | 4329 | void *table = NULL; |
4318 | 4330 | ||
4319 | /* allow the kernel cmdline to have a say */ | 4331 | /* allow the kernel cmdline to have a say */ |
4320 | if (!numentries) { | 4332 | if (!numentries) { |
4321 | /* round applicable memory size up to nearest megabyte */ | 4333 | /* round applicable memory size up to nearest megabyte */ |
4322 | numentries = nr_kernel_pages; | 4334 | numentries = nr_kernel_pages; |
4323 | numentries += (1UL << (20 - PAGE_SHIFT)) - 1; | 4335 | numentries += (1UL << (20 - PAGE_SHIFT)) - 1; |
4324 | numentries >>= 20 - PAGE_SHIFT; | 4336 | numentries >>= 20 - PAGE_SHIFT; |
4325 | numentries <<= 20 - PAGE_SHIFT; | 4337 | numentries <<= 20 - PAGE_SHIFT; |
4326 | 4338 | ||
4327 | /* limit to 1 bucket per 2^scale bytes of low memory */ | 4339 | /* limit to 1 bucket per 2^scale bytes of low memory */ |
4328 | if (scale > PAGE_SHIFT) | 4340 | if (scale > PAGE_SHIFT) |
4329 | numentries >>= (scale - PAGE_SHIFT); | 4341 | numentries >>= (scale - PAGE_SHIFT); |
4330 | else | 4342 | else |
4331 | numentries <<= (PAGE_SHIFT - scale); | 4343 | numentries <<= (PAGE_SHIFT - scale); |
4332 | 4344 | ||
4333 | /* Make sure we've got at least a 0-order allocation.. */ | 4345 | /* Make sure we've got at least a 0-order allocation.. */ |
4334 | if (unlikely((numentries * bucketsize) < PAGE_SIZE)) | 4346 | if (unlikely((numentries * bucketsize) < PAGE_SIZE)) |
4335 | numentries = PAGE_SIZE / bucketsize; | 4347 | numentries = PAGE_SIZE / bucketsize; |
4336 | } | 4348 | } |
4337 | numentries = roundup_pow_of_two(numentries); | 4349 | numentries = roundup_pow_of_two(numentries); |
4338 | 4350 | ||
4339 | /* limit allocation size to 1/16 total memory by default */ | 4351 | /* limit allocation size to 1/16 total memory by default */ |
4340 | if (max == 0) { | 4352 | if (max == 0) { |
4341 | max = ((unsigned long long)nr_all_pages << PAGE_SHIFT) >> 4; | 4353 | max = ((unsigned long long)nr_all_pages << PAGE_SHIFT) >> 4; |
4342 | do_div(max, bucketsize); | 4354 | do_div(max, bucketsize); |
4343 | } | 4355 | } |
4344 | 4356 | ||
4345 | if (numentries > max) | 4357 | if (numentries > max) |
4346 | numentries = max; | 4358 | numentries = max; |
4347 | 4359 | ||
4348 | log2qty = ilog2(numentries); | 4360 | log2qty = ilog2(numentries); |
4349 | 4361 | ||
4350 | do { | 4362 | do { |
4351 | size = bucketsize << log2qty; | 4363 | size = bucketsize << log2qty; |
4352 | if (flags & HASH_EARLY) | 4364 | if (flags & HASH_EARLY) |
4353 | table = alloc_bootmem(size); | 4365 | table = alloc_bootmem(size); |
4354 | else if (hashdist) | 4366 | else if (hashdist) |
4355 | table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL); | 4367 | table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL); |
4356 | else { | 4368 | else { |
4357 | unsigned long order = get_order(size); | 4369 | unsigned long order = get_order(size); |
4358 | table = (void*) __get_free_pages(GFP_ATOMIC, order); | 4370 | table = (void*) __get_free_pages(GFP_ATOMIC, order); |
4359 | /* | 4371 | /* |
4360 | * If bucketsize is not a power-of-two, we may free | 4372 | * If bucketsize is not a power-of-two, we may free |
4361 | * some pages at the end of hash table. | 4373 | * some pages at the end of hash table. |
4362 | */ | 4374 | */ |
4363 | if (table) { | 4375 | if (table) { |
4364 | unsigned long alloc_end = (unsigned long)table + | 4376 | unsigned long alloc_end = (unsigned long)table + |
4365 | (PAGE_SIZE << order); | 4377 | (PAGE_SIZE << order); |
4366 | unsigned long used = (unsigned long)table + | 4378 | unsigned long used = (unsigned long)table + |
4367 | PAGE_ALIGN(size); | 4379 | PAGE_ALIGN(size); |
4368 | split_page(virt_to_page(table), order); | 4380 | split_page(virt_to_page(table), order); |
4369 | while (used < alloc_end) { | 4381 | while (used < alloc_end) { |
4370 | free_page(used); | 4382 | free_page(used); |
4371 | used += PAGE_SIZE; | 4383 | used += PAGE_SIZE; |
4372 | } | 4384 | } |
4373 | } | 4385 | } |
4374 | } | 4386 | } |
4375 | } while (!table && size > PAGE_SIZE && --log2qty); | 4387 | } while (!table && size > PAGE_SIZE && --log2qty); |
4376 | 4388 | ||
4377 | if (!table) | 4389 | if (!table) |
4378 | panic("Failed to allocate %s hash table\n", tablename); | 4390 | panic("Failed to allocate %s hash table\n", tablename); |
4379 | 4391 | ||
4380 | printk(KERN_INFO "%s hash table entries: %d (order: %d, %lu bytes)\n", | 4392 | printk(KERN_INFO "%s hash table entries: %d (order: %d, %lu bytes)\n", |
4381 | tablename, | 4393 | tablename, |
4382 | (1U << log2qty), | 4394 | (1U << log2qty), |
4383 | ilog2(size) - PAGE_SHIFT, | 4395 | ilog2(size) - PAGE_SHIFT, |
4384 | size); | 4396 | size); |
4385 | 4397 | ||
4386 | if (_hash_shift) | 4398 | if (_hash_shift) |
4387 | *_hash_shift = log2qty; | 4399 | *_hash_shift = log2qty; |
4388 | if (_hash_mask) | 4400 | if (_hash_mask) |
4389 | *_hash_mask = (1 << log2qty) - 1; | 4401 | *_hash_mask = (1 << log2qty) - 1; |
4390 | 4402 | ||
4391 | return table; | 4403 | return table; |
4392 | } | 4404 | } |
4393 | 4405 | ||
4394 | #ifdef CONFIG_OUT_OF_LINE_PFN_TO_PAGE | 4406 | #ifdef CONFIG_OUT_OF_LINE_PFN_TO_PAGE |
4395 | struct page *pfn_to_page(unsigned long pfn) | 4407 | struct page *pfn_to_page(unsigned long pfn) |
4396 | { | 4408 | { |
4397 | return __pfn_to_page(pfn); | 4409 | return __pfn_to_page(pfn); |
4398 | } | 4410 | } |
4399 | unsigned long page_to_pfn(struct page *page) | 4411 | unsigned long page_to_pfn(struct page *page) |
4400 | { | 4412 | { |
4401 | return __page_to_pfn(page); | 4413 | return __page_to_pfn(page); |
4402 | } | 4414 | } |
4403 | EXPORT_SYMBOL(pfn_to_page); | 4415 | EXPORT_SYMBOL(pfn_to_page); |
4404 | EXPORT_SYMBOL(page_to_pfn); | 4416 | EXPORT_SYMBOL(page_to_pfn); |
4405 | #endif /* CONFIG_OUT_OF_LINE_PFN_TO_PAGE */ | 4417 | #endif /* CONFIG_OUT_OF_LINE_PFN_TO_PAGE */ |
4406 | 4418 | ||
4407 | /* Return a pointer to the bitmap storing bits affecting a block of pages */ | 4419 | /* Return a pointer to the bitmap storing bits affecting a block of pages */ |
4408 | static inline unsigned long *get_pageblock_bitmap(struct zone *zone, | 4420 | static inline unsigned long *get_pageblock_bitmap(struct zone *zone, |
4409 | unsigned long pfn) | 4421 | unsigned long pfn) |
4410 | { | 4422 | { |
4411 | #ifdef CONFIG_SPARSEMEM | 4423 | #ifdef CONFIG_SPARSEMEM |
4412 | return __pfn_to_section(pfn)->pageblock_flags; | 4424 | return __pfn_to_section(pfn)->pageblock_flags; |
4413 | #else | 4425 | #else |
4414 | return zone->pageblock_flags; | 4426 | return zone->pageblock_flags; |
4415 | #endif /* CONFIG_SPARSEMEM */ | 4427 | #endif /* CONFIG_SPARSEMEM */ |
4416 | } | 4428 | } |
4417 | 4429 | ||
4418 | static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn) | 4430 | static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn) |
4419 | { | 4431 | { |
4420 | #ifdef CONFIG_SPARSEMEM | 4432 | #ifdef CONFIG_SPARSEMEM |
4421 | pfn &= (PAGES_PER_SECTION-1); | 4433 | pfn &= (PAGES_PER_SECTION-1); |
4422 | return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS; | 4434 | return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS; |
4423 | #else | 4435 | #else |
4424 | pfn = pfn - zone->zone_start_pfn; | 4436 | pfn = pfn - zone->zone_start_pfn; |
4425 | return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS; | 4437 | return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS; |
4426 | #endif /* CONFIG_SPARSEMEM */ | 4438 | #endif /* CONFIG_SPARSEMEM */ |
4427 | } | 4439 | } |
4428 | 4440 | ||
4429 | /** | 4441 | /** |
4430 | * get_pageblock_flags_group - Return the requested group of flags for the pageblock_nr_pages block of pages | 4442 | * get_pageblock_flags_group - Return the requested group of flags for the pageblock_nr_pages block of pages |
4431 | * @page: The page within the block of interest | 4443 | * @page: The page within the block of interest |
4432 | * @start_bitidx: The first bit of interest to retrieve | 4444 | * @start_bitidx: The first bit of interest to retrieve |
4433 | * @end_bitidx: The last bit of interest | 4445 | * @end_bitidx: The last bit of interest |
4434 | * returns pageblock_bits flags | 4446 | * returns pageblock_bits flags |
4435 | */ | 4447 | */ |
4436 | unsigned long get_pageblock_flags_group(struct page *page, | 4448 | unsigned long get_pageblock_flags_group(struct page *page, |
4437 | int start_bitidx, int end_bitidx) | 4449 | int start_bitidx, int end_bitidx) |
4438 | { | 4450 | { |
4439 | struct zone *zone; | 4451 | struct zone *zone; |
4440 | unsigned long *bitmap; | 4452 | unsigned long *bitmap; |
4441 | unsigned long pfn, bitidx; | 4453 | unsigned long pfn, bitidx; |
4442 | unsigned long flags = 0; | 4454 | unsigned long flags = 0; |
4443 | unsigned long value = 1; | 4455 | unsigned long value = 1; |
4444 | 4456 | ||
4445 | zone = page_zone(page); | 4457 | zone = page_zone(page); |
4446 | pfn = page_to_pfn(page); | 4458 | pfn = page_to_pfn(page); |
4447 | bitmap = get_pageblock_bitmap(zone, pfn); | 4459 | bitmap = get_pageblock_bitmap(zone, pfn); |
4448 | bitidx = pfn_to_bitidx(zone, pfn); | 4460 | bitidx = pfn_to_bitidx(zone, pfn); |
4449 | 4461 | ||
4450 | for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1) | 4462 | for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1) |
4451 | if (test_bit(bitidx + start_bitidx, bitmap)) | 4463 | if (test_bit(bitidx + start_bitidx, bitmap)) |
4452 | flags |= value; | 4464 | flags |= value; |
4453 | 4465 | ||
4454 | return flags; | 4466 | return flags; |
4455 | } | 4467 | } |
4456 | 4468 | ||
4457 | /** | 4469 | /** |
4458 | * set_pageblock_flags_group - Set the requested group of flags for a pageblock_nr_pages block of pages | 4470 | * set_pageblock_flags_group - Set the requested group of flags for a pageblock_nr_pages block of pages |
4459 | * @page: The page within the block of interest | 4471 | * @page: The page within the block of interest |
4460 | * @start_bitidx: The first bit of interest | 4472 | * @start_bitidx: The first bit of interest |
4461 | * @end_bitidx: The last bit of interest | 4473 | * @end_bitidx: The last bit of interest |
4462 | * @flags: The flags to set | 4474 | * @flags: The flags to set |
4463 | */ | 4475 | */ |
4464 | void set_pageblock_flags_group(struct page *page, unsigned long flags, | 4476 | void set_pageblock_flags_group(struct page *page, unsigned long flags, |
4465 | int start_bitidx, int end_bitidx) | 4477 | int start_bitidx, int end_bitidx) |
4466 | { | 4478 | { |
4467 | struct zone *zone; | 4479 | struct zone *zone; |
4468 | unsigned long *bitmap; | 4480 | unsigned long *bitmap; |
4469 | unsigned long pfn, bitidx; | 4481 | unsigned long pfn, bitidx; |
4470 | unsigned long value = 1; | 4482 | unsigned long value = 1; |
4471 | 4483 | ||
4472 | zone = page_zone(page); | 4484 | zone = page_zone(page); |
4473 | pfn = page_to_pfn(page); | 4485 | pfn = page_to_pfn(page); |
4474 | bitmap = get_pageblock_bitmap(zone, pfn); | 4486 | bitmap = get_pageblock_bitmap(zone, pfn); |
4475 | bitidx = pfn_to_bitidx(zone, pfn); | 4487 | bitidx = pfn_to_bitidx(zone, pfn); |
4476 | VM_BUG_ON(pfn < zone->zone_start_pfn); | 4488 | VM_BUG_ON(pfn < zone->zone_start_pfn); |
4477 | VM_BUG_ON(pfn >= zone->zone_start_pfn + zone->spanned_pages); | 4489 | VM_BUG_ON(pfn >= zone->zone_start_pfn + zone->spanned_pages); |
4478 | 4490 | ||
4479 | for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1) | 4491 | for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1) |
4480 | if (flags & value) | 4492 | if (flags & value) |
4481 | __set_bit(bitidx + start_bitidx, bitmap); | 4493 | __set_bit(bitidx + start_bitidx, bitmap); |
4482 | else | 4494 | else |
4483 | __clear_bit(bitidx + start_bitidx, bitmap); | 4495 | __clear_bit(bitidx + start_bitidx, bitmap); |
4484 | } | 4496 | } |
4485 | 4497 | ||
4486 | /* | 4498 | /* |
4487 | * This is designed as sub function...plz see page_isolation.c also. | 4499 | * This is designed as sub function...plz see page_isolation.c also. |
4488 | * set/clear page block's type to be ISOLATE. | 4500 | * set/clear page block's type to be ISOLATE. |
4489 | * page allocater never alloc memory from ISOLATE block. | 4501 | * page allocater never alloc memory from ISOLATE block. |
4490 | */ | 4502 | */ |
4491 | 4503 | ||
4492 | int set_migratetype_isolate(struct page *page) | 4504 | int set_migratetype_isolate(struct page *page) |
4493 | { | 4505 | { |
4494 | struct zone *zone; | 4506 | struct zone *zone; |
4495 | unsigned long flags; | 4507 | unsigned long flags; |
4496 | int ret = -EBUSY; | 4508 | int ret = -EBUSY; |
4497 | 4509 | ||
4498 | zone = page_zone(page); | 4510 | zone = page_zone(page); |
4499 | spin_lock_irqsave(&zone->lock, flags); | 4511 | spin_lock_irqsave(&zone->lock, flags); |
4500 | /* | 4512 | /* |
4501 | * In future, more migrate types will be able to be isolation target. | 4513 | * In future, more migrate types will be able to be isolation target. |
4502 | */ | 4514 | */ |
4503 | if (get_pageblock_migratetype(page) != MIGRATE_MOVABLE) | 4515 | if (get_pageblock_migratetype(page) != MIGRATE_MOVABLE) |
4504 | goto out; | 4516 | goto out; |
4505 | set_pageblock_migratetype(page, MIGRATE_ISOLATE); | 4517 | set_pageblock_migratetype(page, MIGRATE_ISOLATE); |
4506 | move_freepages_block(zone, page, MIGRATE_ISOLATE); | 4518 | move_freepages_block(zone, page, MIGRATE_ISOLATE); |
4507 | ret = 0; | 4519 | ret = 0; |
4508 | out: | 4520 | out: |
4509 | spin_unlock_irqrestore(&zone->lock, flags); | 4521 | spin_unlock_irqrestore(&zone->lock, flags); |
4510 | if (!ret) | 4522 | if (!ret) |
4511 | drain_all_pages(); | 4523 | drain_all_pages(); |
4512 | return ret; | 4524 | return ret; |
4513 | } | 4525 | } |
4514 | 4526 | ||
4515 | void unset_migratetype_isolate(struct page *page) | 4527 | void unset_migratetype_isolate(struct page *page) |
4516 | { | 4528 | { |
4517 | struct zone *zone; | 4529 | struct zone *zone; |
4518 | unsigned long flags; | 4530 | unsigned long flags; |
4519 | zone = page_zone(page); | 4531 | zone = page_zone(page); |
4520 | spin_lock_irqsave(&zone->lock, flags); | 4532 | spin_lock_irqsave(&zone->lock, flags); |
4521 | if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE) | 4533 | if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE) |
4522 | goto out; | 4534 | goto out; |
4523 | set_pageblock_migratetype(page, MIGRATE_MOVABLE); | 4535 | set_pageblock_migratetype(page, MIGRATE_MOVABLE); |
4524 | move_freepages_block(zone, page, MIGRATE_MOVABLE); | 4536 | move_freepages_block(zone, page, MIGRATE_MOVABLE); |
4525 | out: | 4537 | out: |
4526 | spin_unlock_irqrestore(&zone->lock, flags); | 4538 | spin_unlock_irqrestore(&zone->lock, flags); |
4527 | } | 4539 | } |
4528 | 4540 | ||
4529 | #ifdef CONFIG_MEMORY_HOTREMOVE | 4541 | #ifdef CONFIG_MEMORY_HOTREMOVE |
4530 | /* | 4542 | /* |
4531 | * All pages in the range must be isolated before calling this. | 4543 | * All pages in the range must be isolated before calling this. |
4532 | */ | 4544 | */ |
4533 | void | 4545 | void |
4534 | __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) | 4546 | __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) |
4535 | { | 4547 | { |
4536 | struct page *page; | 4548 | struct page *page; |
4537 | struct zone *zone; | 4549 | struct zone *zone; |
4538 | int order, i; | 4550 | int order, i; |
4539 | unsigned long pfn; | 4551 | unsigned long pfn; |
4540 | unsigned long flags; | 4552 | unsigned long flags; |
4541 | /* find the first valid pfn */ | 4553 | /* find the first valid pfn */ |
4542 | for (pfn = start_pfn; pfn < end_pfn; pfn++) | 4554 | for (pfn = start_pfn; pfn < end_pfn; pfn++) |
4543 | if (pfn_valid(pfn)) | 4555 | if (pfn_valid(pfn)) |
4544 | break; | 4556 | break; |
4545 | if (pfn == end_pfn) | 4557 | if (pfn == end_pfn) |
4546 | return; | 4558 | return; |
4547 | zone = page_zone(pfn_to_page(pfn)); | 4559 | zone = page_zone(pfn_to_page(pfn)); |
4548 | spin_lock_irqsave(&zone->lock, flags); | 4560 | spin_lock_irqsave(&zone->lock, flags); |
4549 | pfn = start_pfn; | 4561 | pfn = start_pfn; |
4550 | while (pfn < end_pfn) { | 4562 | while (pfn < end_pfn) { |
4551 | if (!pfn_valid(pfn)) { | 4563 | if (!pfn_valid(pfn)) { |
4552 | pfn++; | 4564 | pfn++; |
4553 | continue; | 4565 | continue; |
4554 | } | 4566 | } |
4555 | page = pfn_to_page(pfn); | 4567 | page = pfn_to_page(pfn); |
4556 | BUG_ON(page_count(page)); | 4568 | BUG_ON(page_count(page)); |
4557 | BUG_ON(!PageBuddy(page)); | 4569 | BUG_ON(!PageBuddy(page)); |
4558 | order = page_order(page); | 4570 | order = page_order(page); |
4559 | #ifdef CONFIG_DEBUG_VM | 4571 | #ifdef CONFIG_DEBUG_VM |
4560 | printk(KERN_INFO "remove from free list %lx %d %lx\n", | 4572 | printk(KERN_INFO "remove from free list %lx %d %lx\n", |
4561 | pfn, 1 << order, end_pfn); | 4573 | pfn, 1 << order, end_pfn); |
4562 | #endif | 4574 | #endif |
4563 | list_del(&page->lru); | 4575 | list_del(&page->lru); |
4564 | rmv_page_order(page); | 4576 | rmv_page_order(page); |
4565 | zone->free_area[order].nr_free--; | 4577 | zone->free_area[order].nr_free--; |
4566 | __mod_zone_page_state(zone, NR_FREE_PAGES, | 4578 | __mod_zone_page_state(zone, NR_FREE_PAGES, |
4567 | - (1UL << order)); | 4579 | - (1UL << order)); |
4568 | for (i = 0; i < (1 << order); i++) | 4580 | for (i = 0; i < (1 << order); i++) |
4569 | SetPageReserved((page+i)); | 4581 | SetPageReserved((page+i)); |
4570 | pfn += (1 << order); | 4582 | pfn += (1 << order); |
4571 | } | 4583 | } |
4572 | spin_unlock_irqrestore(&zone->lock, flags); | 4584 | spin_unlock_irqrestore(&zone->lock, flags); |
4573 | } | 4585 | } |
4574 | #endif | 4586 | #endif |
mm/vmscan.c
1 | /* | 1 | /* |
2 | * linux/mm/vmscan.c | 2 | * linux/mm/vmscan.c |
3 | * | 3 | * |
4 | * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds | 4 | * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds |
5 | * | 5 | * |
6 | * Swap reorganised 29.12.95, Stephen Tweedie. | 6 | * Swap reorganised 29.12.95, Stephen Tweedie. |
7 | * kswapd added: 7.1.96 sct | 7 | * kswapd added: 7.1.96 sct |
8 | * Removed kswapd_ctl limits, and swap out as many pages as needed | 8 | * Removed kswapd_ctl limits, and swap out as many pages as needed |
9 | * to bring the system back to freepages.high: 2.4.97, Rik van Riel. | 9 | * to bring the system back to freepages.high: 2.4.97, Rik van Riel. |
10 | * Zone aware kswapd started 02/00, Kanoj Sarcar (kanoj@sgi.com). | 10 | * Zone aware kswapd started 02/00, Kanoj Sarcar (kanoj@sgi.com). |
11 | * Multiqueue VM started 5.8.00, Rik van Riel. | 11 | * Multiqueue VM started 5.8.00, Rik van Riel. |
12 | */ | 12 | */ |
13 | 13 | ||
14 | #include <linux/mm.h> | 14 | #include <linux/mm.h> |
15 | #include <linux/module.h> | 15 | #include <linux/module.h> |
16 | #include <linux/slab.h> | 16 | #include <linux/slab.h> |
17 | #include <linux/kernel_stat.h> | 17 | #include <linux/kernel_stat.h> |
18 | #include <linux/swap.h> | 18 | #include <linux/swap.h> |
19 | #include <linux/pagemap.h> | 19 | #include <linux/pagemap.h> |
20 | #include <linux/init.h> | 20 | #include <linux/init.h> |
21 | #include <linux/highmem.h> | 21 | #include <linux/highmem.h> |
22 | #include <linux/vmstat.h> | 22 | #include <linux/vmstat.h> |
23 | #include <linux/file.h> | 23 | #include <linux/file.h> |
24 | #include <linux/writeback.h> | 24 | #include <linux/writeback.h> |
25 | #include <linux/blkdev.h> | 25 | #include <linux/blkdev.h> |
26 | #include <linux/buffer_head.h> /* for try_to_release_page(), | 26 | #include <linux/buffer_head.h> /* for try_to_release_page(), |
27 | buffer_heads_over_limit */ | 27 | buffer_heads_over_limit */ |
28 | #include <linux/mm_inline.h> | 28 | #include <linux/mm_inline.h> |
29 | #include <linux/pagevec.h> | 29 | #include <linux/pagevec.h> |
30 | #include <linux/backing-dev.h> | 30 | #include <linux/backing-dev.h> |
31 | #include <linux/rmap.h> | 31 | #include <linux/rmap.h> |
32 | #include <linux/topology.h> | 32 | #include <linux/topology.h> |
33 | #include <linux/cpu.h> | 33 | #include <linux/cpu.h> |
34 | #include <linux/cpuset.h> | 34 | #include <linux/cpuset.h> |
35 | #include <linux/notifier.h> | 35 | #include <linux/notifier.h> |
36 | #include <linux/rwsem.h> | 36 | #include <linux/rwsem.h> |
37 | #include <linux/delay.h> | 37 | #include <linux/delay.h> |
38 | #include <linux/kthread.h> | 38 | #include <linux/kthread.h> |
39 | #include <linux/freezer.h> | 39 | #include <linux/freezer.h> |
40 | #include <linux/memcontrol.h> | 40 | #include <linux/memcontrol.h> |
41 | 41 | ||
42 | #include <asm/tlbflush.h> | 42 | #include <asm/tlbflush.h> |
43 | #include <asm/div64.h> | 43 | #include <asm/div64.h> |
44 | 44 | ||
45 | #include <linux/swapops.h> | 45 | #include <linux/swapops.h> |
46 | 46 | ||
47 | #include "internal.h" | 47 | #include "internal.h" |
48 | 48 | ||
49 | struct scan_control { | 49 | struct scan_control { |
50 | /* Incremented by the number of inactive pages that were scanned */ | 50 | /* Incremented by the number of inactive pages that were scanned */ |
51 | unsigned long nr_scanned; | 51 | unsigned long nr_scanned; |
52 | 52 | ||
53 | /* This context's GFP mask */ | 53 | /* This context's GFP mask */ |
54 | gfp_t gfp_mask; | 54 | gfp_t gfp_mask; |
55 | 55 | ||
56 | int may_writepage; | 56 | int may_writepage; |
57 | 57 | ||
58 | /* Can pages be swapped as part of reclaim? */ | 58 | /* Can pages be swapped as part of reclaim? */ |
59 | int may_swap; | 59 | int may_swap; |
60 | 60 | ||
61 | /* This context's SWAP_CLUSTER_MAX. If freeing memory for | 61 | /* This context's SWAP_CLUSTER_MAX. If freeing memory for |
62 | * suspend, we effectively ignore SWAP_CLUSTER_MAX. | 62 | * suspend, we effectively ignore SWAP_CLUSTER_MAX. |
63 | * In this context, it doesn't matter that we scan the | 63 | * In this context, it doesn't matter that we scan the |
64 | * whole list at once. */ | 64 | * whole list at once. */ |
65 | int swap_cluster_max; | 65 | int swap_cluster_max; |
66 | 66 | ||
67 | int swappiness; | 67 | int swappiness; |
68 | 68 | ||
69 | int all_unreclaimable; | 69 | int all_unreclaimable; |
70 | 70 | ||
71 | int order; | 71 | int order; |
72 | 72 | ||
73 | /* Which cgroup do we reclaim from */ | 73 | /* Which cgroup do we reclaim from */ |
74 | struct mem_cgroup *mem_cgroup; | 74 | struct mem_cgroup *mem_cgroup; |
75 | 75 | ||
76 | /* Pluggable isolate pages callback */ | 76 | /* Pluggable isolate pages callback */ |
77 | unsigned long (*isolate_pages)(unsigned long nr, struct list_head *dst, | 77 | unsigned long (*isolate_pages)(unsigned long nr, struct list_head *dst, |
78 | unsigned long *scanned, int order, int mode, | 78 | unsigned long *scanned, int order, int mode, |
79 | struct zone *z, struct mem_cgroup *mem_cont, | 79 | struct zone *z, struct mem_cgroup *mem_cont, |
80 | int active); | 80 | int active); |
81 | }; | 81 | }; |
82 | 82 | ||
83 | #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru)) | 83 | #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru)) |
84 | 84 | ||
85 | #ifdef ARCH_HAS_PREFETCH | 85 | #ifdef ARCH_HAS_PREFETCH |
86 | #define prefetch_prev_lru_page(_page, _base, _field) \ | 86 | #define prefetch_prev_lru_page(_page, _base, _field) \ |
87 | do { \ | 87 | do { \ |
88 | if ((_page)->lru.prev != _base) { \ | 88 | if ((_page)->lru.prev != _base) { \ |
89 | struct page *prev; \ | 89 | struct page *prev; \ |
90 | \ | 90 | \ |
91 | prev = lru_to_page(&(_page->lru)); \ | 91 | prev = lru_to_page(&(_page->lru)); \ |
92 | prefetch(&prev->_field); \ | 92 | prefetch(&prev->_field); \ |
93 | } \ | 93 | } \ |
94 | } while (0) | 94 | } while (0) |
95 | #else | 95 | #else |
96 | #define prefetch_prev_lru_page(_page, _base, _field) do { } while (0) | 96 | #define prefetch_prev_lru_page(_page, _base, _field) do { } while (0) |
97 | #endif | 97 | #endif |
98 | 98 | ||
99 | #ifdef ARCH_HAS_PREFETCHW | 99 | #ifdef ARCH_HAS_PREFETCHW |
100 | #define prefetchw_prev_lru_page(_page, _base, _field) \ | 100 | #define prefetchw_prev_lru_page(_page, _base, _field) \ |
101 | do { \ | 101 | do { \ |
102 | if ((_page)->lru.prev != _base) { \ | 102 | if ((_page)->lru.prev != _base) { \ |
103 | struct page *prev; \ | 103 | struct page *prev; \ |
104 | \ | 104 | \ |
105 | prev = lru_to_page(&(_page->lru)); \ | 105 | prev = lru_to_page(&(_page->lru)); \ |
106 | prefetchw(&prev->_field); \ | 106 | prefetchw(&prev->_field); \ |
107 | } \ | 107 | } \ |
108 | } while (0) | 108 | } while (0) |
109 | #else | 109 | #else |
110 | #define prefetchw_prev_lru_page(_page, _base, _field) do { } while (0) | 110 | #define prefetchw_prev_lru_page(_page, _base, _field) do { } while (0) |
111 | #endif | 111 | #endif |
112 | 112 | ||
113 | /* | 113 | /* |
114 | * From 0 .. 100. Higher means more swappy. | 114 | * From 0 .. 100. Higher means more swappy. |
115 | */ | 115 | */ |
116 | int vm_swappiness = 60; | 116 | int vm_swappiness = 60; |
117 | long vm_total_pages; /* The total number of pages which the VM controls */ | 117 | long vm_total_pages; /* The total number of pages which the VM controls */ |
118 | 118 | ||
119 | static LIST_HEAD(shrinker_list); | 119 | static LIST_HEAD(shrinker_list); |
120 | static DECLARE_RWSEM(shrinker_rwsem); | 120 | static DECLARE_RWSEM(shrinker_rwsem); |
121 | 121 | ||
122 | #ifdef CONFIG_CGROUP_MEM_RES_CTLR | 122 | #ifdef CONFIG_CGROUP_MEM_RES_CTLR |
123 | #define scan_global_lru(sc) (!(sc)->mem_cgroup) | 123 | #define scan_global_lru(sc) (!(sc)->mem_cgroup) |
124 | #else | 124 | #else |
125 | #define scan_global_lru(sc) (1) | 125 | #define scan_global_lru(sc) (1) |
126 | #endif | 126 | #endif |
127 | 127 | ||
128 | /* | 128 | /* |
129 | * Add a shrinker callback to be called from the vm | 129 | * Add a shrinker callback to be called from the vm |
130 | */ | 130 | */ |
131 | void register_shrinker(struct shrinker *shrinker) | 131 | void register_shrinker(struct shrinker *shrinker) |
132 | { | 132 | { |
133 | shrinker->nr = 0; | 133 | shrinker->nr = 0; |
134 | down_write(&shrinker_rwsem); | 134 | down_write(&shrinker_rwsem); |
135 | list_add_tail(&shrinker->list, &shrinker_list); | 135 | list_add_tail(&shrinker->list, &shrinker_list); |
136 | up_write(&shrinker_rwsem); | 136 | up_write(&shrinker_rwsem); |
137 | } | 137 | } |
138 | EXPORT_SYMBOL(register_shrinker); | 138 | EXPORT_SYMBOL(register_shrinker); |
139 | 139 | ||
140 | /* | 140 | /* |
141 | * Remove one | 141 | * Remove one |
142 | */ | 142 | */ |
143 | void unregister_shrinker(struct shrinker *shrinker) | 143 | void unregister_shrinker(struct shrinker *shrinker) |
144 | { | 144 | { |
145 | down_write(&shrinker_rwsem); | 145 | down_write(&shrinker_rwsem); |
146 | list_del(&shrinker->list); | 146 | list_del(&shrinker->list); |
147 | up_write(&shrinker_rwsem); | 147 | up_write(&shrinker_rwsem); |
148 | } | 148 | } |
149 | EXPORT_SYMBOL(unregister_shrinker); | 149 | EXPORT_SYMBOL(unregister_shrinker); |
150 | 150 | ||
151 | #define SHRINK_BATCH 128 | 151 | #define SHRINK_BATCH 128 |
152 | /* | 152 | /* |
153 | * Call the shrink functions to age shrinkable caches | 153 | * Call the shrink functions to age shrinkable caches |
154 | * | 154 | * |
155 | * Here we assume it costs one seek to replace a lru page and that it also | 155 | * Here we assume it costs one seek to replace a lru page and that it also |
156 | * takes a seek to recreate a cache object. With this in mind we age equal | 156 | * takes a seek to recreate a cache object. With this in mind we age equal |
157 | * percentages of the lru and ageable caches. This should balance the seeks | 157 | * percentages of the lru and ageable caches. This should balance the seeks |
158 | * generated by these structures. | 158 | * generated by these structures. |
159 | * | 159 | * |
160 | * If the vm encountered mapped pages on the LRU it increase the pressure on | 160 | * If the vm encountered mapped pages on the LRU it increase the pressure on |
161 | * slab to avoid swapping. | 161 | * slab to avoid swapping. |
162 | * | 162 | * |
163 | * We do weird things to avoid (scanned*seeks*entries) overflowing 32 bits. | 163 | * We do weird things to avoid (scanned*seeks*entries) overflowing 32 bits. |
164 | * | 164 | * |
165 | * `lru_pages' represents the number of on-LRU pages in all the zones which | 165 | * `lru_pages' represents the number of on-LRU pages in all the zones which |
166 | * are eligible for the caller's allocation attempt. It is used for balancing | 166 | * are eligible for the caller's allocation attempt. It is used for balancing |
167 | * slab reclaim versus page reclaim. | 167 | * slab reclaim versus page reclaim. |
168 | * | 168 | * |
169 | * Returns the number of slab objects which we shrunk. | 169 | * Returns the number of slab objects which we shrunk. |
170 | */ | 170 | */ |
171 | unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, | 171 | unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, |
172 | unsigned long lru_pages) | 172 | unsigned long lru_pages) |
173 | { | 173 | { |
174 | struct shrinker *shrinker; | 174 | struct shrinker *shrinker; |
175 | unsigned long ret = 0; | 175 | unsigned long ret = 0; |
176 | 176 | ||
177 | if (scanned == 0) | 177 | if (scanned == 0) |
178 | scanned = SWAP_CLUSTER_MAX; | 178 | scanned = SWAP_CLUSTER_MAX; |
179 | 179 | ||
180 | if (!down_read_trylock(&shrinker_rwsem)) | 180 | if (!down_read_trylock(&shrinker_rwsem)) |
181 | return 1; /* Assume we'll be able to shrink next time */ | 181 | return 1; /* Assume we'll be able to shrink next time */ |
182 | 182 | ||
183 | list_for_each_entry(shrinker, &shrinker_list, list) { | 183 | list_for_each_entry(shrinker, &shrinker_list, list) { |
184 | unsigned long long delta; | 184 | unsigned long long delta; |
185 | unsigned long total_scan; | 185 | unsigned long total_scan; |
186 | unsigned long max_pass = (*shrinker->shrink)(0, gfp_mask); | 186 | unsigned long max_pass = (*shrinker->shrink)(0, gfp_mask); |
187 | 187 | ||
188 | delta = (4 * scanned) / shrinker->seeks; | 188 | delta = (4 * scanned) / shrinker->seeks; |
189 | delta *= max_pass; | 189 | delta *= max_pass; |
190 | do_div(delta, lru_pages + 1); | 190 | do_div(delta, lru_pages + 1); |
191 | shrinker->nr += delta; | 191 | shrinker->nr += delta; |
192 | if (shrinker->nr < 0) { | 192 | if (shrinker->nr < 0) { |
193 | printk(KERN_ERR "%s: nr=%ld\n", | 193 | printk(KERN_ERR "%s: nr=%ld\n", |
194 | __FUNCTION__, shrinker->nr); | 194 | __FUNCTION__, shrinker->nr); |
195 | shrinker->nr = max_pass; | 195 | shrinker->nr = max_pass; |
196 | } | 196 | } |
197 | 197 | ||
198 | /* | 198 | /* |
199 | * Avoid risking looping forever due to too large nr value: | 199 | * Avoid risking looping forever due to too large nr value: |
200 | * never try to free more than twice the estimate number of | 200 | * never try to free more than twice the estimate number of |
201 | * freeable entries. | 201 | * freeable entries. |
202 | */ | 202 | */ |
203 | if (shrinker->nr > max_pass * 2) | 203 | if (shrinker->nr > max_pass * 2) |
204 | shrinker->nr = max_pass * 2; | 204 | shrinker->nr = max_pass * 2; |
205 | 205 | ||
206 | total_scan = shrinker->nr; | 206 | total_scan = shrinker->nr; |
207 | shrinker->nr = 0; | 207 | shrinker->nr = 0; |
208 | 208 | ||
209 | while (total_scan >= SHRINK_BATCH) { | 209 | while (total_scan >= SHRINK_BATCH) { |
210 | long this_scan = SHRINK_BATCH; | 210 | long this_scan = SHRINK_BATCH; |
211 | int shrink_ret; | 211 | int shrink_ret; |
212 | int nr_before; | 212 | int nr_before; |
213 | 213 | ||
214 | nr_before = (*shrinker->shrink)(0, gfp_mask); | 214 | nr_before = (*shrinker->shrink)(0, gfp_mask); |
215 | shrink_ret = (*shrinker->shrink)(this_scan, gfp_mask); | 215 | shrink_ret = (*shrinker->shrink)(this_scan, gfp_mask); |
216 | if (shrink_ret == -1) | 216 | if (shrink_ret == -1) |
217 | break; | 217 | break; |
218 | if (shrink_ret < nr_before) | 218 | if (shrink_ret < nr_before) |
219 | ret += nr_before - shrink_ret; | 219 | ret += nr_before - shrink_ret; |
220 | count_vm_events(SLABS_SCANNED, this_scan); | 220 | count_vm_events(SLABS_SCANNED, this_scan); |
221 | total_scan -= this_scan; | 221 | total_scan -= this_scan; |
222 | 222 | ||
223 | cond_resched(); | 223 | cond_resched(); |
224 | } | 224 | } |
225 | 225 | ||
226 | shrinker->nr += total_scan; | 226 | shrinker->nr += total_scan; |
227 | } | 227 | } |
228 | up_read(&shrinker_rwsem); | 228 | up_read(&shrinker_rwsem); |
229 | return ret; | 229 | return ret; |
230 | } | 230 | } |
231 | 231 | ||
232 | /* Called without lock on whether page is mapped, so answer is unstable */ | 232 | /* Called without lock on whether page is mapped, so answer is unstable */ |
233 | static inline int page_mapping_inuse(struct page *page) | 233 | static inline int page_mapping_inuse(struct page *page) |
234 | { | 234 | { |
235 | struct address_space *mapping; | 235 | struct address_space *mapping; |
236 | 236 | ||
237 | /* Page is in somebody's page tables. */ | 237 | /* Page is in somebody's page tables. */ |
238 | if (page_mapped(page)) | 238 | if (page_mapped(page)) |
239 | return 1; | 239 | return 1; |
240 | 240 | ||
241 | /* Be more reluctant to reclaim swapcache than pagecache */ | 241 | /* Be more reluctant to reclaim swapcache than pagecache */ |
242 | if (PageSwapCache(page)) | 242 | if (PageSwapCache(page)) |
243 | return 1; | 243 | return 1; |
244 | 244 | ||
245 | mapping = page_mapping(page); | 245 | mapping = page_mapping(page); |
246 | if (!mapping) | 246 | if (!mapping) |
247 | return 0; | 247 | return 0; |
248 | 248 | ||
249 | /* File is mmap'd by somebody? */ | 249 | /* File is mmap'd by somebody? */ |
250 | return mapping_mapped(mapping); | 250 | return mapping_mapped(mapping); |
251 | } | 251 | } |
252 | 252 | ||
253 | static inline int is_page_cache_freeable(struct page *page) | 253 | static inline int is_page_cache_freeable(struct page *page) |
254 | { | 254 | { |
255 | return page_count(page) - !!PagePrivate(page) == 2; | 255 | return page_count(page) - !!PagePrivate(page) == 2; |
256 | } | 256 | } |
257 | 257 | ||
258 | static int may_write_to_queue(struct backing_dev_info *bdi) | 258 | static int may_write_to_queue(struct backing_dev_info *bdi) |
259 | { | 259 | { |
260 | if (current->flags & PF_SWAPWRITE) | 260 | if (current->flags & PF_SWAPWRITE) |
261 | return 1; | 261 | return 1; |
262 | if (!bdi_write_congested(bdi)) | 262 | if (!bdi_write_congested(bdi)) |
263 | return 1; | 263 | return 1; |
264 | if (bdi == current->backing_dev_info) | 264 | if (bdi == current->backing_dev_info) |
265 | return 1; | 265 | return 1; |
266 | return 0; | 266 | return 0; |
267 | } | 267 | } |
268 | 268 | ||
269 | /* | 269 | /* |
270 | * We detected a synchronous write error writing a page out. Probably | 270 | * We detected a synchronous write error writing a page out. Probably |
271 | * -ENOSPC. We need to propagate that into the address_space for a subsequent | 271 | * -ENOSPC. We need to propagate that into the address_space for a subsequent |
272 | * fsync(), msync() or close(). | 272 | * fsync(), msync() or close(). |
273 | * | 273 | * |
274 | * The tricky part is that after writepage we cannot touch the mapping: nothing | 274 | * The tricky part is that after writepage we cannot touch the mapping: nothing |
275 | * prevents it from being freed up. But we have a ref on the page and once | 275 | * prevents it from being freed up. But we have a ref on the page and once |
276 | * that page is locked, the mapping is pinned. | 276 | * that page is locked, the mapping is pinned. |
277 | * | 277 | * |
278 | * We're allowed to run sleeping lock_page() here because we know the caller has | 278 | * We're allowed to run sleeping lock_page() here because we know the caller has |
279 | * __GFP_FS. | 279 | * __GFP_FS. |
280 | */ | 280 | */ |
281 | static void handle_write_error(struct address_space *mapping, | 281 | static void handle_write_error(struct address_space *mapping, |
282 | struct page *page, int error) | 282 | struct page *page, int error) |
283 | { | 283 | { |
284 | lock_page(page); | 284 | lock_page(page); |
285 | if (page_mapping(page) == mapping) | 285 | if (page_mapping(page) == mapping) |
286 | mapping_set_error(mapping, error); | 286 | mapping_set_error(mapping, error); |
287 | unlock_page(page); | 287 | unlock_page(page); |
288 | } | 288 | } |
289 | 289 | ||
290 | /* Request for sync pageout. */ | 290 | /* Request for sync pageout. */ |
291 | enum pageout_io { | 291 | enum pageout_io { |
292 | PAGEOUT_IO_ASYNC, | 292 | PAGEOUT_IO_ASYNC, |
293 | PAGEOUT_IO_SYNC, | 293 | PAGEOUT_IO_SYNC, |
294 | }; | 294 | }; |
295 | 295 | ||
296 | /* possible outcome of pageout() */ | 296 | /* possible outcome of pageout() */ |
297 | typedef enum { | 297 | typedef enum { |
298 | /* failed to write page out, page is locked */ | 298 | /* failed to write page out, page is locked */ |
299 | PAGE_KEEP, | 299 | PAGE_KEEP, |
300 | /* move page to the active list, page is locked */ | 300 | /* move page to the active list, page is locked */ |
301 | PAGE_ACTIVATE, | 301 | PAGE_ACTIVATE, |
302 | /* page has been sent to the disk successfully, page is unlocked */ | 302 | /* page has been sent to the disk successfully, page is unlocked */ |
303 | PAGE_SUCCESS, | 303 | PAGE_SUCCESS, |
304 | /* page is clean and locked */ | 304 | /* page is clean and locked */ |
305 | PAGE_CLEAN, | 305 | PAGE_CLEAN, |
306 | } pageout_t; | 306 | } pageout_t; |
307 | 307 | ||
308 | /* | 308 | /* |
309 | * pageout is called by shrink_page_list() for each dirty page. | 309 | * pageout is called by shrink_page_list() for each dirty page. |
310 | * Calls ->writepage(). | 310 | * Calls ->writepage(). |
311 | */ | 311 | */ |
312 | static pageout_t pageout(struct page *page, struct address_space *mapping, | 312 | static pageout_t pageout(struct page *page, struct address_space *mapping, |
313 | enum pageout_io sync_writeback) | 313 | enum pageout_io sync_writeback) |
314 | { | 314 | { |
315 | /* | 315 | /* |
316 | * If the page is dirty, only perform writeback if that write | 316 | * If the page is dirty, only perform writeback if that write |
317 | * will be non-blocking. To prevent this allocation from being | 317 | * will be non-blocking. To prevent this allocation from being |
318 | * stalled by pagecache activity. But note that there may be | 318 | * stalled by pagecache activity. But note that there may be |
319 | * stalls if we need to run get_block(). We could test | 319 | * stalls if we need to run get_block(). We could test |
320 | * PagePrivate for that. | 320 | * PagePrivate for that. |
321 | * | 321 | * |
322 | * If this process is currently in generic_file_write() against | 322 | * If this process is currently in generic_file_write() against |
323 | * this page's queue, we can perform writeback even if that | 323 | * this page's queue, we can perform writeback even if that |
324 | * will block. | 324 | * will block. |
325 | * | 325 | * |
326 | * If the page is swapcache, write it back even if that would | 326 | * If the page is swapcache, write it back even if that would |
327 | * block, for some throttling. This happens by accident, because | 327 | * block, for some throttling. This happens by accident, because |
328 | * swap_backing_dev_info is bust: it doesn't reflect the | 328 | * swap_backing_dev_info is bust: it doesn't reflect the |
329 | * congestion state of the swapdevs. Easy to fix, if needed. | 329 | * congestion state of the swapdevs. Easy to fix, if needed. |
330 | * See swapfile.c:page_queue_congested(). | 330 | * See swapfile.c:page_queue_congested(). |
331 | */ | 331 | */ |
332 | if (!is_page_cache_freeable(page)) | 332 | if (!is_page_cache_freeable(page)) |
333 | return PAGE_KEEP; | 333 | return PAGE_KEEP; |
334 | if (!mapping) { | 334 | if (!mapping) { |
335 | /* | 335 | /* |
336 | * Some data journaling orphaned pages can have | 336 | * Some data journaling orphaned pages can have |
337 | * page->mapping == NULL while being dirty with clean buffers. | 337 | * page->mapping == NULL while being dirty with clean buffers. |
338 | */ | 338 | */ |
339 | if (PagePrivate(page)) { | 339 | if (PagePrivate(page)) { |
340 | if (try_to_free_buffers(page)) { | 340 | if (try_to_free_buffers(page)) { |
341 | ClearPageDirty(page); | 341 | ClearPageDirty(page); |
342 | printk("%s: orphaned page\n", __FUNCTION__); | 342 | printk("%s: orphaned page\n", __FUNCTION__); |
343 | return PAGE_CLEAN; | 343 | return PAGE_CLEAN; |
344 | } | 344 | } |
345 | } | 345 | } |
346 | return PAGE_KEEP; | 346 | return PAGE_KEEP; |
347 | } | 347 | } |
348 | if (mapping->a_ops->writepage == NULL) | 348 | if (mapping->a_ops->writepage == NULL) |
349 | return PAGE_ACTIVATE; | 349 | return PAGE_ACTIVATE; |
350 | if (!may_write_to_queue(mapping->backing_dev_info)) | 350 | if (!may_write_to_queue(mapping->backing_dev_info)) |
351 | return PAGE_KEEP; | 351 | return PAGE_KEEP; |
352 | 352 | ||
353 | if (clear_page_dirty_for_io(page)) { | 353 | if (clear_page_dirty_for_io(page)) { |
354 | int res; | 354 | int res; |
355 | struct writeback_control wbc = { | 355 | struct writeback_control wbc = { |
356 | .sync_mode = WB_SYNC_NONE, | 356 | .sync_mode = WB_SYNC_NONE, |
357 | .nr_to_write = SWAP_CLUSTER_MAX, | 357 | .nr_to_write = SWAP_CLUSTER_MAX, |
358 | .range_start = 0, | 358 | .range_start = 0, |
359 | .range_end = LLONG_MAX, | 359 | .range_end = LLONG_MAX, |
360 | .nonblocking = 1, | 360 | .nonblocking = 1, |
361 | .for_reclaim = 1, | 361 | .for_reclaim = 1, |
362 | }; | 362 | }; |
363 | 363 | ||
364 | SetPageReclaim(page); | 364 | SetPageReclaim(page); |
365 | res = mapping->a_ops->writepage(page, &wbc); | 365 | res = mapping->a_ops->writepage(page, &wbc); |
366 | if (res < 0) | 366 | if (res < 0) |
367 | handle_write_error(mapping, page, res); | 367 | handle_write_error(mapping, page, res); |
368 | if (res == AOP_WRITEPAGE_ACTIVATE) { | 368 | if (res == AOP_WRITEPAGE_ACTIVATE) { |
369 | ClearPageReclaim(page); | 369 | ClearPageReclaim(page); |
370 | return PAGE_ACTIVATE; | 370 | return PAGE_ACTIVATE; |
371 | } | 371 | } |
372 | 372 | ||
373 | /* | 373 | /* |
374 | * Wait on writeback if requested to. This happens when | 374 | * Wait on writeback if requested to. This happens when |
375 | * direct reclaiming a large contiguous area and the | 375 | * direct reclaiming a large contiguous area and the |
376 | * first attempt to free a range of pages fails. | 376 | * first attempt to free a range of pages fails. |
377 | */ | 377 | */ |
378 | if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC) | 378 | if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC) |
379 | wait_on_page_writeback(page); | 379 | wait_on_page_writeback(page); |
380 | 380 | ||
381 | if (!PageWriteback(page)) { | 381 | if (!PageWriteback(page)) { |
382 | /* synchronous write or broken a_ops? */ | 382 | /* synchronous write or broken a_ops? */ |
383 | ClearPageReclaim(page); | 383 | ClearPageReclaim(page); |
384 | } | 384 | } |
385 | inc_zone_page_state(page, NR_VMSCAN_WRITE); | 385 | inc_zone_page_state(page, NR_VMSCAN_WRITE); |
386 | return PAGE_SUCCESS; | 386 | return PAGE_SUCCESS; |
387 | } | 387 | } |
388 | 388 | ||
389 | return PAGE_CLEAN; | 389 | return PAGE_CLEAN; |
390 | } | 390 | } |
391 | 391 | ||
392 | /* | 392 | /* |
393 | * Attempt to detach a locked page from its ->mapping. If it is dirty or if | 393 | * Attempt to detach a locked page from its ->mapping. If it is dirty or if |
394 | * someone else has a ref on the page, abort and return 0. If it was | 394 | * someone else has a ref on the page, abort and return 0. If it was |
395 | * successfully detached, return 1. Assumes the caller has a single ref on | 395 | * successfully detached, return 1. Assumes the caller has a single ref on |
396 | * this page. | 396 | * this page. |
397 | */ | 397 | */ |
398 | int remove_mapping(struct address_space *mapping, struct page *page) | 398 | int remove_mapping(struct address_space *mapping, struct page *page) |
399 | { | 399 | { |
400 | BUG_ON(!PageLocked(page)); | 400 | BUG_ON(!PageLocked(page)); |
401 | BUG_ON(mapping != page_mapping(page)); | 401 | BUG_ON(mapping != page_mapping(page)); |
402 | 402 | ||
403 | write_lock_irq(&mapping->tree_lock); | 403 | write_lock_irq(&mapping->tree_lock); |
404 | /* | 404 | /* |
405 | * The non racy check for a busy page. | 405 | * The non racy check for a busy page. |
406 | * | 406 | * |
407 | * Must be careful with the order of the tests. When someone has | 407 | * Must be careful with the order of the tests. When someone has |
408 | * a ref to the page, it may be possible that they dirty it then | 408 | * a ref to the page, it may be possible that they dirty it then |
409 | * drop the reference. So if PageDirty is tested before page_count | 409 | * drop the reference. So if PageDirty is tested before page_count |
410 | * here, then the following race may occur: | 410 | * here, then the following race may occur: |
411 | * | 411 | * |
412 | * get_user_pages(&page); | 412 | * get_user_pages(&page); |
413 | * [user mapping goes away] | 413 | * [user mapping goes away] |
414 | * write_to(page); | 414 | * write_to(page); |
415 | * !PageDirty(page) [good] | 415 | * !PageDirty(page) [good] |
416 | * SetPageDirty(page); | 416 | * SetPageDirty(page); |
417 | * put_page(page); | 417 | * put_page(page); |
418 | * !page_count(page) [good, discard it] | 418 | * !page_count(page) [good, discard it] |
419 | * | 419 | * |
420 | * [oops, our write_to data is lost] | 420 | * [oops, our write_to data is lost] |
421 | * | 421 | * |
422 | * Reversing the order of the tests ensures such a situation cannot | 422 | * Reversing the order of the tests ensures such a situation cannot |
423 | * escape unnoticed. The smp_rmb is needed to ensure the page->flags | 423 | * escape unnoticed. The smp_rmb is needed to ensure the page->flags |
424 | * load is not satisfied before that of page->_count. | 424 | * load is not satisfied before that of page->_count. |
425 | * | 425 | * |
426 | * Note that if SetPageDirty is always performed via set_page_dirty, | 426 | * Note that if SetPageDirty is always performed via set_page_dirty, |
427 | * and thus under tree_lock, then this ordering is not required. | 427 | * and thus under tree_lock, then this ordering is not required. |
428 | */ | 428 | */ |
429 | if (unlikely(page_count(page) != 2)) | 429 | if (unlikely(page_count(page) != 2)) |
430 | goto cannot_free; | 430 | goto cannot_free; |
431 | smp_rmb(); | 431 | smp_rmb(); |
432 | if (unlikely(PageDirty(page))) | 432 | if (unlikely(PageDirty(page))) |
433 | goto cannot_free; | 433 | goto cannot_free; |
434 | 434 | ||
435 | if (PageSwapCache(page)) { | 435 | if (PageSwapCache(page)) { |
436 | swp_entry_t swap = { .val = page_private(page) }; | 436 | swp_entry_t swap = { .val = page_private(page) }; |
437 | __delete_from_swap_cache(page); | 437 | __delete_from_swap_cache(page); |
438 | write_unlock_irq(&mapping->tree_lock); | 438 | write_unlock_irq(&mapping->tree_lock); |
439 | swap_free(swap); | 439 | swap_free(swap); |
440 | __put_page(page); /* The pagecache ref */ | 440 | __put_page(page); /* The pagecache ref */ |
441 | return 1; | 441 | return 1; |
442 | } | 442 | } |
443 | 443 | ||
444 | __remove_from_page_cache(page); | 444 | __remove_from_page_cache(page); |
445 | write_unlock_irq(&mapping->tree_lock); | 445 | write_unlock_irq(&mapping->tree_lock); |
446 | __put_page(page); | 446 | __put_page(page); |
447 | return 1; | 447 | return 1; |
448 | 448 | ||
449 | cannot_free: | 449 | cannot_free: |
450 | write_unlock_irq(&mapping->tree_lock); | 450 | write_unlock_irq(&mapping->tree_lock); |
451 | return 0; | 451 | return 0; |
452 | } | 452 | } |
453 | 453 | ||
454 | /* | 454 | /* |
455 | * shrink_page_list() returns the number of reclaimed pages | 455 | * shrink_page_list() returns the number of reclaimed pages |
456 | */ | 456 | */ |
457 | static unsigned long shrink_page_list(struct list_head *page_list, | 457 | static unsigned long shrink_page_list(struct list_head *page_list, |
458 | struct scan_control *sc, | 458 | struct scan_control *sc, |
459 | enum pageout_io sync_writeback) | 459 | enum pageout_io sync_writeback) |
460 | { | 460 | { |
461 | LIST_HEAD(ret_pages); | 461 | LIST_HEAD(ret_pages); |
462 | struct pagevec freed_pvec; | 462 | struct pagevec freed_pvec; |
463 | int pgactivate = 0; | 463 | int pgactivate = 0; |
464 | unsigned long nr_reclaimed = 0; | 464 | unsigned long nr_reclaimed = 0; |
465 | 465 | ||
466 | cond_resched(); | 466 | cond_resched(); |
467 | 467 | ||
468 | pagevec_init(&freed_pvec, 1); | 468 | pagevec_init(&freed_pvec, 1); |
469 | while (!list_empty(page_list)) { | 469 | while (!list_empty(page_list)) { |
470 | struct address_space *mapping; | 470 | struct address_space *mapping; |
471 | struct page *page; | 471 | struct page *page; |
472 | int may_enter_fs; | 472 | int may_enter_fs; |
473 | int referenced; | 473 | int referenced; |
474 | 474 | ||
475 | cond_resched(); | 475 | cond_resched(); |
476 | 476 | ||
477 | page = lru_to_page(page_list); | 477 | page = lru_to_page(page_list); |
478 | list_del(&page->lru); | 478 | list_del(&page->lru); |
479 | 479 | ||
480 | if (TestSetPageLocked(page)) | 480 | if (TestSetPageLocked(page)) |
481 | goto keep; | 481 | goto keep; |
482 | 482 | ||
483 | VM_BUG_ON(PageActive(page)); | 483 | VM_BUG_ON(PageActive(page)); |
484 | 484 | ||
485 | sc->nr_scanned++; | 485 | sc->nr_scanned++; |
486 | 486 | ||
487 | if (!sc->may_swap && page_mapped(page)) | 487 | if (!sc->may_swap && page_mapped(page)) |
488 | goto keep_locked; | 488 | goto keep_locked; |
489 | 489 | ||
490 | /* Double the slab pressure for mapped and swapcache pages */ | 490 | /* Double the slab pressure for mapped and swapcache pages */ |
491 | if (page_mapped(page) || PageSwapCache(page)) | 491 | if (page_mapped(page) || PageSwapCache(page)) |
492 | sc->nr_scanned++; | 492 | sc->nr_scanned++; |
493 | 493 | ||
494 | may_enter_fs = (sc->gfp_mask & __GFP_FS) || | 494 | may_enter_fs = (sc->gfp_mask & __GFP_FS) || |
495 | (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO)); | 495 | (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO)); |
496 | 496 | ||
497 | if (PageWriteback(page)) { | 497 | if (PageWriteback(page)) { |
498 | /* | 498 | /* |
499 | * Synchronous reclaim is performed in two passes, | 499 | * Synchronous reclaim is performed in two passes, |
500 | * first an asynchronous pass over the list to | 500 | * first an asynchronous pass over the list to |
501 | * start parallel writeback, and a second synchronous | 501 | * start parallel writeback, and a second synchronous |
502 | * pass to wait for the IO to complete. Wait here | 502 | * pass to wait for the IO to complete. Wait here |
503 | * for any page for which writeback has already | 503 | * for any page for which writeback has already |
504 | * started. | 504 | * started. |
505 | */ | 505 | */ |
506 | if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs) | 506 | if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs) |
507 | wait_on_page_writeback(page); | 507 | wait_on_page_writeback(page); |
508 | else | 508 | else |
509 | goto keep_locked; | 509 | goto keep_locked; |
510 | } | 510 | } |
511 | 511 | ||
512 | referenced = page_referenced(page, 1, sc->mem_cgroup); | 512 | referenced = page_referenced(page, 1, sc->mem_cgroup); |
513 | /* In active use or really unfreeable? Activate it. */ | 513 | /* In active use or really unfreeable? Activate it. */ |
514 | if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && | 514 | if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && |
515 | referenced && page_mapping_inuse(page)) | 515 | referenced && page_mapping_inuse(page)) |
516 | goto activate_locked; | 516 | goto activate_locked; |
517 | 517 | ||
518 | #ifdef CONFIG_SWAP | 518 | #ifdef CONFIG_SWAP |
519 | /* | 519 | /* |
520 | * Anonymous process memory has backing store? | 520 | * Anonymous process memory has backing store? |
521 | * Try to allocate it some swap space here. | 521 | * Try to allocate it some swap space here. |
522 | */ | 522 | */ |
523 | if (PageAnon(page) && !PageSwapCache(page)) | 523 | if (PageAnon(page) && !PageSwapCache(page)) |
524 | if (!add_to_swap(page, GFP_ATOMIC)) | 524 | if (!add_to_swap(page, GFP_ATOMIC)) |
525 | goto activate_locked; | 525 | goto activate_locked; |
526 | #endif /* CONFIG_SWAP */ | 526 | #endif /* CONFIG_SWAP */ |
527 | 527 | ||
528 | mapping = page_mapping(page); | 528 | mapping = page_mapping(page); |
529 | 529 | ||
530 | /* | 530 | /* |
531 | * The page is mapped into the page tables of one or more | 531 | * The page is mapped into the page tables of one or more |
532 | * processes. Try to unmap it here. | 532 | * processes. Try to unmap it here. |
533 | */ | 533 | */ |
534 | if (page_mapped(page) && mapping) { | 534 | if (page_mapped(page) && mapping) { |
535 | switch (try_to_unmap(page, 0)) { | 535 | switch (try_to_unmap(page, 0)) { |
536 | case SWAP_FAIL: | 536 | case SWAP_FAIL: |
537 | goto activate_locked; | 537 | goto activate_locked; |
538 | case SWAP_AGAIN: | 538 | case SWAP_AGAIN: |
539 | goto keep_locked; | 539 | goto keep_locked; |
540 | case SWAP_SUCCESS: | 540 | case SWAP_SUCCESS: |
541 | ; /* try to free the page below */ | 541 | ; /* try to free the page below */ |
542 | } | 542 | } |
543 | } | 543 | } |
544 | 544 | ||
545 | if (PageDirty(page)) { | 545 | if (PageDirty(page)) { |
546 | if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && referenced) | 546 | if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && referenced) |
547 | goto keep_locked; | 547 | goto keep_locked; |
548 | if (!may_enter_fs) | 548 | if (!may_enter_fs) |
549 | goto keep_locked; | 549 | goto keep_locked; |
550 | if (!sc->may_writepage) | 550 | if (!sc->may_writepage) |
551 | goto keep_locked; | 551 | goto keep_locked; |
552 | 552 | ||
553 | /* Page is dirty, try to write it out here */ | 553 | /* Page is dirty, try to write it out here */ |
554 | switch (pageout(page, mapping, sync_writeback)) { | 554 | switch (pageout(page, mapping, sync_writeback)) { |
555 | case PAGE_KEEP: | 555 | case PAGE_KEEP: |
556 | goto keep_locked; | 556 | goto keep_locked; |
557 | case PAGE_ACTIVATE: | 557 | case PAGE_ACTIVATE: |
558 | goto activate_locked; | 558 | goto activate_locked; |
559 | case PAGE_SUCCESS: | 559 | case PAGE_SUCCESS: |
560 | if (PageWriteback(page) || PageDirty(page)) | 560 | if (PageWriteback(page) || PageDirty(page)) |
561 | goto keep; | 561 | goto keep; |
562 | /* | 562 | /* |
563 | * A synchronous write - probably a ramdisk. Go | 563 | * A synchronous write - probably a ramdisk. Go |
564 | * ahead and try to reclaim the page. | 564 | * ahead and try to reclaim the page. |
565 | */ | 565 | */ |
566 | if (TestSetPageLocked(page)) | 566 | if (TestSetPageLocked(page)) |
567 | goto keep; | 567 | goto keep; |
568 | if (PageDirty(page) || PageWriteback(page)) | 568 | if (PageDirty(page) || PageWriteback(page)) |
569 | goto keep_locked; | 569 | goto keep_locked; |
570 | mapping = page_mapping(page); | 570 | mapping = page_mapping(page); |
571 | case PAGE_CLEAN: | 571 | case PAGE_CLEAN: |
572 | ; /* try to free the page below */ | 572 | ; /* try to free the page below */ |
573 | } | 573 | } |
574 | } | 574 | } |
575 | 575 | ||
576 | /* | 576 | /* |
577 | * If the page has buffers, try to free the buffer mappings | 577 | * If the page has buffers, try to free the buffer mappings |
578 | * associated with this page. If we succeed we try to free | 578 | * associated with this page. If we succeed we try to free |
579 | * the page as well. | 579 | * the page as well. |
580 | * | 580 | * |
581 | * We do this even if the page is PageDirty(). | 581 | * We do this even if the page is PageDirty(). |
582 | * try_to_release_page() does not perform I/O, but it is | 582 | * try_to_release_page() does not perform I/O, but it is |
583 | * possible for a page to have PageDirty set, but it is actually | 583 | * possible for a page to have PageDirty set, but it is actually |
584 | * clean (all its buffers are clean). This happens if the | 584 | * clean (all its buffers are clean). This happens if the |
585 | * buffers were written out directly, with submit_bh(). ext3 | 585 | * buffers were written out directly, with submit_bh(). ext3 |
586 | * will do this, as well as the blockdev mapping. | 586 | * will do this, as well as the blockdev mapping. |
587 | * try_to_release_page() will discover that cleanness and will | 587 | * try_to_release_page() will discover that cleanness and will |
588 | * drop the buffers and mark the page clean - it can be freed. | 588 | * drop the buffers and mark the page clean - it can be freed. |
589 | * | 589 | * |
590 | * Rarely, pages can have buffers and no ->mapping. These are | 590 | * Rarely, pages can have buffers and no ->mapping. These are |
591 | * the pages which were not successfully invalidated in | 591 | * the pages which were not successfully invalidated in |
592 | * truncate_complete_page(). We try to drop those buffers here | 592 | * truncate_complete_page(). We try to drop those buffers here |
593 | * and if that worked, and the page is no longer mapped into | 593 | * and if that worked, and the page is no longer mapped into |
594 | * process address space (page_count == 1) it can be freed. | 594 | * process address space (page_count == 1) it can be freed. |
595 | * Otherwise, leave the page on the LRU so it is swappable. | 595 | * Otherwise, leave the page on the LRU so it is swappable. |
596 | */ | 596 | */ |
597 | if (PagePrivate(page)) { | 597 | if (PagePrivate(page)) { |
598 | if (!try_to_release_page(page, sc->gfp_mask)) | 598 | if (!try_to_release_page(page, sc->gfp_mask)) |
599 | goto activate_locked; | 599 | goto activate_locked; |
600 | if (!mapping && page_count(page) == 1) | 600 | if (!mapping && page_count(page) == 1) |
601 | goto free_it; | 601 | goto free_it; |
602 | } | 602 | } |
603 | 603 | ||
604 | if (!mapping || !remove_mapping(mapping, page)) | 604 | if (!mapping || !remove_mapping(mapping, page)) |
605 | goto keep_locked; | 605 | goto keep_locked; |
606 | 606 | ||
607 | free_it: | 607 | free_it: |
608 | unlock_page(page); | 608 | unlock_page(page); |
609 | nr_reclaimed++; | 609 | nr_reclaimed++; |
610 | if (!pagevec_add(&freed_pvec, page)) | 610 | if (!pagevec_add(&freed_pvec, page)) |
611 | __pagevec_release_nonlru(&freed_pvec); | 611 | __pagevec_release_nonlru(&freed_pvec); |
612 | continue; | 612 | continue; |
613 | 613 | ||
614 | activate_locked: | 614 | activate_locked: |
615 | SetPageActive(page); | 615 | SetPageActive(page); |
616 | pgactivate++; | 616 | pgactivate++; |
617 | keep_locked: | 617 | keep_locked: |
618 | unlock_page(page); | 618 | unlock_page(page); |
619 | keep: | 619 | keep: |
620 | list_add(&page->lru, &ret_pages); | 620 | list_add(&page->lru, &ret_pages); |
621 | VM_BUG_ON(PageLRU(page)); | 621 | VM_BUG_ON(PageLRU(page)); |
622 | } | 622 | } |
623 | list_splice(&ret_pages, page_list); | 623 | list_splice(&ret_pages, page_list); |
624 | if (pagevec_count(&freed_pvec)) | 624 | if (pagevec_count(&freed_pvec)) |
625 | __pagevec_release_nonlru(&freed_pvec); | 625 | __pagevec_release_nonlru(&freed_pvec); |
626 | count_vm_events(PGACTIVATE, pgactivate); | 626 | count_vm_events(PGACTIVATE, pgactivate); |
627 | return nr_reclaimed; | 627 | return nr_reclaimed; |
628 | } | 628 | } |
629 | 629 | ||
630 | /* LRU Isolation modes. */ | 630 | /* LRU Isolation modes. */ |
631 | #define ISOLATE_INACTIVE 0 /* Isolate inactive pages. */ | 631 | #define ISOLATE_INACTIVE 0 /* Isolate inactive pages. */ |
632 | #define ISOLATE_ACTIVE 1 /* Isolate active pages. */ | 632 | #define ISOLATE_ACTIVE 1 /* Isolate active pages. */ |
633 | #define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */ | 633 | #define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */ |
634 | 634 | ||
635 | /* | 635 | /* |
636 | * Attempt to remove the specified page from its LRU. Only take this page | 636 | * Attempt to remove the specified page from its LRU. Only take this page |
637 | * if it is of the appropriate PageActive status. Pages which are being | 637 | * if it is of the appropriate PageActive status. Pages which are being |
638 | * freed elsewhere are also ignored. | 638 | * freed elsewhere are also ignored. |
639 | * | 639 | * |
640 | * page: page to consider | 640 | * page: page to consider |
641 | * mode: one of the LRU isolation modes defined above | 641 | * mode: one of the LRU isolation modes defined above |
642 | * | 642 | * |
643 | * returns 0 on success, -ve errno on failure. | 643 | * returns 0 on success, -ve errno on failure. |
644 | */ | 644 | */ |
645 | int __isolate_lru_page(struct page *page, int mode) | 645 | int __isolate_lru_page(struct page *page, int mode) |
646 | { | 646 | { |
647 | int ret = -EINVAL; | 647 | int ret = -EINVAL; |
648 | 648 | ||
649 | /* Only take pages on the LRU. */ | 649 | /* Only take pages on the LRU. */ |
650 | if (!PageLRU(page)) | 650 | if (!PageLRU(page)) |
651 | return ret; | 651 | return ret; |
652 | 652 | ||
653 | /* | 653 | /* |
654 | * When checking the active state, we need to be sure we are | 654 | * When checking the active state, we need to be sure we are |
655 | * dealing with comparible boolean values. Take the logical not | 655 | * dealing with comparible boolean values. Take the logical not |
656 | * of each. | 656 | * of each. |
657 | */ | 657 | */ |
658 | if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode)) | 658 | if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode)) |
659 | return ret; | 659 | return ret; |
660 | 660 | ||
661 | ret = -EBUSY; | 661 | ret = -EBUSY; |
662 | if (likely(get_page_unless_zero(page))) { | 662 | if (likely(get_page_unless_zero(page))) { |
663 | /* | 663 | /* |
664 | * Be careful not to clear PageLRU until after we're | 664 | * Be careful not to clear PageLRU until after we're |
665 | * sure the page is not being freed elsewhere -- the | 665 | * sure the page is not being freed elsewhere -- the |
666 | * page release code relies on it. | 666 | * page release code relies on it. |
667 | */ | 667 | */ |
668 | ClearPageLRU(page); | 668 | ClearPageLRU(page); |
669 | ret = 0; | 669 | ret = 0; |
670 | } | 670 | } |
671 | 671 | ||
672 | return ret; | 672 | return ret; |
673 | } | 673 | } |
674 | 674 | ||
675 | /* | 675 | /* |
676 | * zone->lru_lock is heavily contended. Some of the functions that | 676 | * zone->lru_lock is heavily contended. Some of the functions that |
677 | * shrink the lists perform better by taking out a batch of pages | 677 | * shrink the lists perform better by taking out a batch of pages |
678 | * and working on them outside the LRU lock. | 678 | * and working on them outside the LRU lock. |
679 | * | 679 | * |
680 | * For pagecache intensive workloads, this function is the hottest | 680 | * For pagecache intensive workloads, this function is the hottest |
681 | * spot in the kernel (apart from copy_*_user functions). | 681 | * spot in the kernel (apart from copy_*_user functions). |
682 | * | 682 | * |
683 | * Appropriate locks must be held before calling this function. | 683 | * Appropriate locks must be held before calling this function. |
684 | * | 684 | * |
685 | * @nr_to_scan: The number of pages to look through on the list. | 685 | * @nr_to_scan: The number of pages to look through on the list. |
686 | * @src: The LRU list to pull pages off. | 686 | * @src: The LRU list to pull pages off. |
687 | * @dst: The temp list to put pages on to. | 687 | * @dst: The temp list to put pages on to. |
688 | * @scanned: The number of pages that were scanned. | 688 | * @scanned: The number of pages that were scanned. |
689 | * @order: The caller's attempted allocation order | 689 | * @order: The caller's attempted allocation order |
690 | * @mode: One of the LRU isolation modes | 690 | * @mode: One of the LRU isolation modes |
691 | * | 691 | * |
692 | * returns how many pages were moved onto *@dst. | 692 | * returns how many pages were moved onto *@dst. |
693 | */ | 693 | */ |
694 | static unsigned long isolate_lru_pages(unsigned long nr_to_scan, | 694 | static unsigned long isolate_lru_pages(unsigned long nr_to_scan, |
695 | struct list_head *src, struct list_head *dst, | 695 | struct list_head *src, struct list_head *dst, |
696 | unsigned long *scanned, int order, int mode) | 696 | unsigned long *scanned, int order, int mode) |
697 | { | 697 | { |
698 | unsigned long nr_taken = 0; | 698 | unsigned long nr_taken = 0; |
699 | unsigned long scan; | 699 | unsigned long scan; |
700 | 700 | ||
701 | for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) { | 701 | for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) { |
702 | struct page *page; | 702 | struct page *page; |
703 | unsigned long pfn; | 703 | unsigned long pfn; |
704 | unsigned long end_pfn; | 704 | unsigned long end_pfn; |
705 | unsigned long page_pfn; | 705 | unsigned long page_pfn; |
706 | int zone_id; | 706 | int zone_id; |
707 | 707 | ||
708 | page = lru_to_page(src); | 708 | page = lru_to_page(src); |
709 | prefetchw_prev_lru_page(page, src, flags); | 709 | prefetchw_prev_lru_page(page, src, flags); |
710 | 710 | ||
711 | VM_BUG_ON(!PageLRU(page)); | 711 | VM_BUG_ON(!PageLRU(page)); |
712 | 712 | ||
713 | switch (__isolate_lru_page(page, mode)) { | 713 | switch (__isolate_lru_page(page, mode)) { |
714 | case 0: | 714 | case 0: |
715 | list_move(&page->lru, dst); | 715 | list_move(&page->lru, dst); |
716 | nr_taken++; | 716 | nr_taken++; |
717 | break; | 717 | break; |
718 | 718 | ||
719 | case -EBUSY: | 719 | case -EBUSY: |
720 | /* else it is being freed elsewhere */ | 720 | /* else it is being freed elsewhere */ |
721 | list_move(&page->lru, src); | 721 | list_move(&page->lru, src); |
722 | continue; | 722 | continue; |
723 | 723 | ||
724 | default: | 724 | default: |
725 | BUG(); | 725 | BUG(); |
726 | } | 726 | } |
727 | 727 | ||
728 | if (!order) | 728 | if (!order) |
729 | continue; | 729 | continue; |
730 | 730 | ||
731 | /* | 731 | /* |
732 | * Attempt to take all pages in the order aligned region | 732 | * Attempt to take all pages in the order aligned region |
733 | * surrounding the tag page. Only take those pages of | 733 | * surrounding the tag page. Only take those pages of |
734 | * the same active state as that tag page. We may safely | 734 | * the same active state as that tag page. We may safely |
735 | * round the target page pfn down to the requested order | 735 | * round the target page pfn down to the requested order |
736 | * as the mem_map is guarenteed valid out to MAX_ORDER, | 736 | * as the mem_map is guarenteed valid out to MAX_ORDER, |
737 | * where that page is in a different zone we will detect | 737 | * where that page is in a different zone we will detect |
738 | * it from its zone id and abort this block scan. | 738 | * it from its zone id and abort this block scan. |
739 | */ | 739 | */ |
740 | zone_id = page_zone_id(page); | 740 | zone_id = page_zone_id(page); |
741 | page_pfn = page_to_pfn(page); | 741 | page_pfn = page_to_pfn(page); |
742 | pfn = page_pfn & ~((1 << order) - 1); | 742 | pfn = page_pfn & ~((1 << order) - 1); |
743 | end_pfn = pfn + (1 << order); | 743 | end_pfn = pfn + (1 << order); |
744 | for (; pfn < end_pfn; pfn++) { | 744 | for (; pfn < end_pfn; pfn++) { |
745 | struct page *cursor_page; | 745 | struct page *cursor_page; |
746 | 746 | ||
747 | /* The target page is in the block, ignore it. */ | 747 | /* The target page is in the block, ignore it. */ |
748 | if (unlikely(pfn == page_pfn)) | 748 | if (unlikely(pfn == page_pfn)) |
749 | continue; | 749 | continue; |
750 | 750 | ||
751 | /* Avoid holes within the zone. */ | 751 | /* Avoid holes within the zone. */ |
752 | if (unlikely(!pfn_valid_within(pfn))) | 752 | if (unlikely(!pfn_valid_within(pfn))) |
753 | break; | 753 | break; |
754 | 754 | ||
755 | cursor_page = pfn_to_page(pfn); | 755 | cursor_page = pfn_to_page(pfn); |
756 | /* Check that we have not crossed a zone boundary. */ | 756 | /* Check that we have not crossed a zone boundary. */ |
757 | if (unlikely(page_zone_id(cursor_page) != zone_id)) | 757 | if (unlikely(page_zone_id(cursor_page) != zone_id)) |
758 | continue; | 758 | continue; |
759 | switch (__isolate_lru_page(cursor_page, mode)) { | 759 | switch (__isolate_lru_page(cursor_page, mode)) { |
760 | case 0: | 760 | case 0: |
761 | list_move(&cursor_page->lru, dst); | 761 | list_move(&cursor_page->lru, dst); |
762 | nr_taken++; | 762 | nr_taken++; |
763 | scan++; | 763 | scan++; |
764 | break; | 764 | break; |
765 | 765 | ||
766 | case -EBUSY: | 766 | case -EBUSY: |
767 | /* else it is being freed elsewhere */ | 767 | /* else it is being freed elsewhere */ |
768 | list_move(&cursor_page->lru, src); | 768 | list_move(&cursor_page->lru, src); |
769 | default: | 769 | default: |
770 | break; | 770 | break; |
771 | } | 771 | } |
772 | } | 772 | } |
773 | } | 773 | } |
774 | 774 | ||
775 | *scanned = scan; | 775 | *scanned = scan; |
776 | return nr_taken; | 776 | return nr_taken; |
777 | } | 777 | } |
778 | 778 | ||
779 | static unsigned long isolate_pages_global(unsigned long nr, | 779 | static unsigned long isolate_pages_global(unsigned long nr, |
780 | struct list_head *dst, | 780 | struct list_head *dst, |
781 | unsigned long *scanned, int order, | 781 | unsigned long *scanned, int order, |
782 | int mode, struct zone *z, | 782 | int mode, struct zone *z, |
783 | struct mem_cgroup *mem_cont, | 783 | struct mem_cgroup *mem_cont, |
784 | int active) | 784 | int active) |
785 | { | 785 | { |
786 | if (active) | 786 | if (active) |
787 | return isolate_lru_pages(nr, &z->active_list, dst, | 787 | return isolate_lru_pages(nr, &z->active_list, dst, |
788 | scanned, order, mode); | 788 | scanned, order, mode); |
789 | else | 789 | else |
790 | return isolate_lru_pages(nr, &z->inactive_list, dst, | 790 | return isolate_lru_pages(nr, &z->inactive_list, dst, |
791 | scanned, order, mode); | 791 | scanned, order, mode); |
792 | } | 792 | } |
793 | 793 | ||
794 | /* | 794 | /* |
795 | * clear_active_flags() is a helper for shrink_active_list(), clearing | 795 | * clear_active_flags() is a helper for shrink_active_list(), clearing |
796 | * any active bits from the pages in the list. | 796 | * any active bits from the pages in the list. |
797 | */ | 797 | */ |
798 | static unsigned long clear_active_flags(struct list_head *page_list) | 798 | static unsigned long clear_active_flags(struct list_head *page_list) |
799 | { | 799 | { |
800 | int nr_active = 0; | 800 | int nr_active = 0; |
801 | struct page *page; | 801 | struct page *page; |
802 | 802 | ||
803 | list_for_each_entry(page, page_list, lru) | 803 | list_for_each_entry(page, page_list, lru) |
804 | if (PageActive(page)) { | 804 | if (PageActive(page)) { |
805 | ClearPageActive(page); | 805 | ClearPageActive(page); |
806 | nr_active++; | 806 | nr_active++; |
807 | } | 807 | } |
808 | 808 | ||
809 | return nr_active; | 809 | return nr_active; |
810 | } | 810 | } |
811 | 811 | ||
812 | /* | 812 | /* |
813 | * shrink_inactive_list() is a helper for shrink_zone(). It returns the number | 813 | * shrink_inactive_list() is a helper for shrink_zone(). It returns the number |
814 | * of reclaimed pages | 814 | * of reclaimed pages |
815 | */ | 815 | */ |
816 | static unsigned long shrink_inactive_list(unsigned long max_scan, | 816 | static unsigned long shrink_inactive_list(unsigned long max_scan, |
817 | struct zone *zone, struct scan_control *sc) | 817 | struct zone *zone, struct scan_control *sc) |
818 | { | 818 | { |
819 | LIST_HEAD(page_list); | 819 | LIST_HEAD(page_list); |
820 | struct pagevec pvec; | 820 | struct pagevec pvec; |
821 | unsigned long nr_scanned = 0; | 821 | unsigned long nr_scanned = 0; |
822 | unsigned long nr_reclaimed = 0; | 822 | unsigned long nr_reclaimed = 0; |
823 | 823 | ||
824 | pagevec_init(&pvec, 1); | 824 | pagevec_init(&pvec, 1); |
825 | 825 | ||
826 | lru_add_drain(); | 826 | lru_add_drain(); |
827 | spin_lock_irq(&zone->lru_lock); | 827 | spin_lock_irq(&zone->lru_lock); |
828 | do { | 828 | do { |
829 | struct page *page; | 829 | struct page *page; |
830 | unsigned long nr_taken; | 830 | unsigned long nr_taken; |
831 | unsigned long nr_scan; | 831 | unsigned long nr_scan; |
832 | unsigned long nr_freed; | 832 | unsigned long nr_freed; |
833 | unsigned long nr_active; | 833 | unsigned long nr_active; |
834 | 834 | ||
835 | nr_taken = sc->isolate_pages(sc->swap_cluster_max, | 835 | nr_taken = sc->isolate_pages(sc->swap_cluster_max, |
836 | &page_list, &nr_scan, sc->order, | 836 | &page_list, &nr_scan, sc->order, |
837 | (sc->order > PAGE_ALLOC_COSTLY_ORDER)? | 837 | (sc->order > PAGE_ALLOC_COSTLY_ORDER)? |
838 | ISOLATE_BOTH : ISOLATE_INACTIVE, | 838 | ISOLATE_BOTH : ISOLATE_INACTIVE, |
839 | zone, sc->mem_cgroup, 0); | 839 | zone, sc->mem_cgroup, 0); |
840 | nr_active = clear_active_flags(&page_list); | 840 | nr_active = clear_active_flags(&page_list); |
841 | __count_vm_events(PGDEACTIVATE, nr_active); | 841 | __count_vm_events(PGDEACTIVATE, nr_active); |
842 | 842 | ||
843 | __mod_zone_page_state(zone, NR_ACTIVE, -nr_active); | 843 | __mod_zone_page_state(zone, NR_ACTIVE, -nr_active); |
844 | __mod_zone_page_state(zone, NR_INACTIVE, | 844 | __mod_zone_page_state(zone, NR_INACTIVE, |
845 | -(nr_taken - nr_active)); | 845 | -(nr_taken - nr_active)); |
846 | if (scan_global_lru(sc)) | 846 | if (scan_global_lru(sc)) |
847 | zone->pages_scanned += nr_scan; | 847 | zone->pages_scanned += nr_scan; |
848 | spin_unlock_irq(&zone->lru_lock); | 848 | spin_unlock_irq(&zone->lru_lock); |
849 | 849 | ||
850 | nr_scanned += nr_scan; | 850 | nr_scanned += nr_scan; |
851 | nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC); | 851 | nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC); |
852 | 852 | ||
853 | /* | 853 | /* |
854 | * If we are direct reclaiming for contiguous pages and we do | 854 | * If we are direct reclaiming for contiguous pages and we do |
855 | * not reclaim everything in the list, try again and wait | 855 | * not reclaim everything in the list, try again and wait |
856 | * for IO to complete. This will stall high-order allocations | 856 | * for IO to complete. This will stall high-order allocations |
857 | * but that should be acceptable to the caller | 857 | * but that should be acceptable to the caller |
858 | */ | 858 | */ |
859 | if (nr_freed < nr_taken && !current_is_kswapd() && | 859 | if (nr_freed < nr_taken && !current_is_kswapd() && |
860 | sc->order > PAGE_ALLOC_COSTLY_ORDER) { | 860 | sc->order > PAGE_ALLOC_COSTLY_ORDER) { |
861 | congestion_wait(WRITE, HZ/10); | 861 | congestion_wait(WRITE, HZ/10); |
862 | 862 | ||
863 | /* | 863 | /* |
864 | * The attempt at page out may have made some | 864 | * The attempt at page out may have made some |
865 | * of the pages active, mark them inactive again. | 865 | * of the pages active, mark them inactive again. |
866 | */ | 866 | */ |
867 | nr_active = clear_active_flags(&page_list); | 867 | nr_active = clear_active_flags(&page_list); |
868 | count_vm_events(PGDEACTIVATE, nr_active); | 868 | count_vm_events(PGDEACTIVATE, nr_active); |
869 | 869 | ||
870 | nr_freed += shrink_page_list(&page_list, sc, | 870 | nr_freed += shrink_page_list(&page_list, sc, |
871 | PAGEOUT_IO_SYNC); | 871 | PAGEOUT_IO_SYNC); |
872 | } | 872 | } |
873 | 873 | ||
874 | nr_reclaimed += nr_freed; | 874 | nr_reclaimed += nr_freed; |
875 | local_irq_disable(); | 875 | local_irq_disable(); |
876 | if (current_is_kswapd()) { | 876 | if (current_is_kswapd()) { |
877 | __count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scan); | 877 | __count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scan); |
878 | __count_vm_events(KSWAPD_STEAL, nr_freed); | 878 | __count_vm_events(KSWAPD_STEAL, nr_freed); |
879 | } else if (scan_global_lru(sc)) | 879 | } else if (scan_global_lru(sc)) |
880 | __count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scan); | 880 | __count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scan); |
881 | 881 | ||
882 | __count_zone_vm_events(PGSTEAL, zone, nr_freed); | 882 | __count_zone_vm_events(PGSTEAL, zone, nr_freed); |
883 | 883 | ||
884 | if (nr_taken == 0) | 884 | if (nr_taken == 0) |
885 | goto done; | 885 | goto done; |
886 | 886 | ||
887 | spin_lock(&zone->lru_lock); | 887 | spin_lock(&zone->lru_lock); |
888 | /* | 888 | /* |
889 | * Put back any unfreeable pages. | 889 | * Put back any unfreeable pages. |
890 | */ | 890 | */ |
891 | while (!list_empty(&page_list)) { | 891 | while (!list_empty(&page_list)) { |
892 | page = lru_to_page(&page_list); | 892 | page = lru_to_page(&page_list); |
893 | VM_BUG_ON(PageLRU(page)); | 893 | VM_BUG_ON(PageLRU(page)); |
894 | SetPageLRU(page); | 894 | SetPageLRU(page); |
895 | list_del(&page->lru); | 895 | list_del(&page->lru); |
896 | if (PageActive(page)) | 896 | if (PageActive(page)) |
897 | add_page_to_active_list(zone, page); | 897 | add_page_to_active_list(zone, page); |
898 | else | 898 | else |
899 | add_page_to_inactive_list(zone, page); | 899 | add_page_to_inactive_list(zone, page); |
900 | if (!pagevec_add(&pvec, page)) { | 900 | if (!pagevec_add(&pvec, page)) { |
901 | spin_unlock_irq(&zone->lru_lock); | 901 | spin_unlock_irq(&zone->lru_lock); |
902 | __pagevec_release(&pvec); | 902 | __pagevec_release(&pvec); |
903 | spin_lock_irq(&zone->lru_lock); | 903 | spin_lock_irq(&zone->lru_lock); |
904 | } | 904 | } |
905 | } | 905 | } |
906 | } while (nr_scanned < max_scan); | 906 | } while (nr_scanned < max_scan); |
907 | spin_unlock(&zone->lru_lock); | 907 | spin_unlock(&zone->lru_lock); |
908 | done: | 908 | done: |
909 | local_irq_enable(); | 909 | local_irq_enable(); |
910 | pagevec_release(&pvec); | 910 | pagevec_release(&pvec); |
911 | return nr_reclaimed; | 911 | return nr_reclaimed; |
912 | } | 912 | } |
913 | 913 | ||
914 | /* | 914 | /* |
915 | * We are about to scan this zone at a certain priority level. If that priority | 915 | * We are about to scan this zone at a certain priority level. If that priority |
916 | * level is smaller (ie: more urgent) than the previous priority, then note | 916 | * level is smaller (ie: more urgent) than the previous priority, then note |
917 | * that priority level within the zone. This is done so that when the next | 917 | * that priority level within the zone. This is done so that when the next |
918 | * process comes in to scan this zone, it will immediately start out at this | 918 | * process comes in to scan this zone, it will immediately start out at this |
919 | * priority level rather than having to build up its own scanning priority. | 919 | * priority level rather than having to build up its own scanning priority. |
920 | * Here, this priority affects only the reclaim-mapped threshold. | 920 | * Here, this priority affects only the reclaim-mapped threshold. |
921 | */ | 921 | */ |
922 | static inline void note_zone_scanning_priority(struct zone *zone, int priority) | 922 | static inline void note_zone_scanning_priority(struct zone *zone, int priority) |
923 | { | 923 | { |
924 | if (priority < zone->prev_priority) | 924 | if (priority < zone->prev_priority) |
925 | zone->prev_priority = priority; | 925 | zone->prev_priority = priority; |
926 | } | 926 | } |
927 | 927 | ||
928 | static inline int zone_is_near_oom(struct zone *zone) | 928 | static inline int zone_is_near_oom(struct zone *zone) |
929 | { | 929 | { |
930 | return zone->pages_scanned >= (zone_page_state(zone, NR_ACTIVE) | 930 | return zone->pages_scanned >= (zone_page_state(zone, NR_ACTIVE) |
931 | + zone_page_state(zone, NR_INACTIVE))*3; | 931 | + zone_page_state(zone, NR_INACTIVE))*3; |
932 | } | 932 | } |
933 | 933 | ||
934 | /* | 934 | /* |
935 | * Determine we should try to reclaim mapped pages. | 935 | * Determine we should try to reclaim mapped pages. |
936 | * This is called only when sc->mem_cgroup is NULL. | 936 | * This is called only when sc->mem_cgroup is NULL. |
937 | */ | 937 | */ |
938 | static int calc_reclaim_mapped(struct scan_control *sc, struct zone *zone, | 938 | static int calc_reclaim_mapped(struct scan_control *sc, struct zone *zone, |
939 | int priority) | 939 | int priority) |
940 | { | 940 | { |
941 | long mapped_ratio; | 941 | long mapped_ratio; |
942 | long distress; | 942 | long distress; |
943 | long swap_tendency; | 943 | long swap_tendency; |
944 | long imbalance; | 944 | long imbalance; |
945 | int reclaim_mapped = 0; | 945 | int reclaim_mapped = 0; |
946 | int prev_priority; | 946 | int prev_priority; |
947 | 947 | ||
948 | if (scan_global_lru(sc) && zone_is_near_oom(zone)) | 948 | if (scan_global_lru(sc) && zone_is_near_oom(zone)) |
949 | return 1; | 949 | return 1; |
950 | /* | 950 | /* |
951 | * `distress' is a measure of how much trouble we're having | 951 | * `distress' is a measure of how much trouble we're having |
952 | * reclaiming pages. 0 -> no problems. 100 -> great trouble. | 952 | * reclaiming pages. 0 -> no problems. 100 -> great trouble. |
953 | */ | 953 | */ |
954 | if (scan_global_lru(sc)) | 954 | if (scan_global_lru(sc)) |
955 | prev_priority = zone->prev_priority; | 955 | prev_priority = zone->prev_priority; |
956 | else | 956 | else |
957 | prev_priority = mem_cgroup_get_reclaim_priority(sc->mem_cgroup); | 957 | prev_priority = mem_cgroup_get_reclaim_priority(sc->mem_cgroup); |
958 | 958 | ||
959 | distress = 100 >> min(prev_priority, priority); | 959 | distress = 100 >> min(prev_priority, priority); |
960 | 960 | ||
961 | /* | 961 | /* |
962 | * The point of this algorithm is to decide when to start | 962 | * The point of this algorithm is to decide when to start |
963 | * reclaiming mapped memory instead of just pagecache. Work out | 963 | * reclaiming mapped memory instead of just pagecache. Work out |
964 | * how much memory | 964 | * how much memory |
965 | * is mapped. | 965 | * is mapped. |
966 | */ | 966 | */ |
967 | if (scan_global_lru(sc)) | 967 | if (scan_global_lru(sc)) |
968 | mapped_ratio = ((global_page_state(NR_FILE_MAPPED) + | 968 | mapped_ratio = ((global_page_state(NR_FILE_MAPPED) + |
969 | global_page_state(NR_ANON_PAGES)) * 100) / | 969 | global_page_state(NR_ANON_PAGES)) * 100) / |
970 | vm_total_pages; | 970 | vm_total_pages; |
971 | else | 971 | else |
972 | mapped_ratio = mem_cgroup_calc_mapped_ratio(sc->mem_cgroup); | 972 | mapped_ratio = mem_cgroup_calc_mapped_ratio(sc->mem_cgroup); |
973 | 973 | ||
974 | /* | 974 | /* |
975 | * Now decide how much we really want to unmap some pages. The | 975 | * Now decide how much we really want to unmap some pages. The |
976 | * mapped ratio is downgraded - just because there's a lot of | 976 | * mapped ratio is downgraded - just because there's a lot of |
977 | * mapped memory doesn't necessarily mean that page reclaim | 977 | * mapped memory doesn't necessarily mean that page reclaim |
978 | * isn't succeeding. | 978 | * isn't succeeding. |
979 | * | 979 | * |
980 | * The distress ratio is important - we don't want to start | 980 | * The distress ratio is important - we don't want to start |
981 | * going oom. | 981 | * going oom. |
982 | * | 982 | * |
983 | * A 100% value of vm_swappiness overrides this algorithm | 983 | * A 100% value of vm_swappiness overrides this algorithm |
984 | * altogether. | 984 | * altogether. |
985 | */ | 985 | */ |
986 | swap_tendency = mapped_ratio / 2 + distress + sc->swappiness; | 986 | swap_tendency = mapped_ratio / 2 + distress + sc->swappiness; |
987 | 987 | ||
988 | /* | 988 | /* |
989 | * If there's huge imbalance between active and inactive | 989 | * If there's huge imbalance between active and inactive |
990 | * (think active 100 times larger than inactive) we should | 990 | * (think active 100 times larger than inactive) we should |
991 | * become more permissive, or the system will take too much | 991 | * become more permissive, or the system will take too much |
992 | * cpu before it start swapping during memory pressure. | 992 | * cpu before it start swapping during memory pressure. |
993 | * Distress is about avoiding early-oom, this is about | 993 | * Distress is about avoiding early-oom, this is about |
994 | * making swappiness graceful despite setting it to low | 994 | * making swappiness graceful despite setting it to low |
995 | * values. | 995 | * values. |
996 | * | 996 | * |
997 | * Avoid div by zero with nr_inactive+1, and max resulting | 997 | * Avoid div by zero with nr_inactive+1, and max resulting |
998 | * value is vm_total_pages. | 998 | * value is vm_total_pages. |
999 | */ | 999 | */ |
1000 | if (scan_global_lru(sc)) { | 1000 | if (scan_global_lru(sc)) { |
1001 | imbalance = zone_page_state(zone, NR_ACTIVE); | 1001 | imbalance = zone_page_state(zone, NR_ACTIVE); |
1002 | imbalance /= zone_page_state(zone, NR_INACTIVE) + 1; | 1002 | imbalance /= zone_page_state(zone, NR_INACTIVE) + 1; |
1003 | } else | 1003 | } else |
1004 | imbalance = mem_cgroup_reclaim_imbalance(sc->mem_cgroup); | 1004 | imbalance = mem_cgroup_reclaim_imbalance(sc->mem_cgroup); |
1005 | 1005 | ||
1006 | /* | 1006 | /* |
1007 | * Reduce the effect of imbalance if swappiness is low, | 1007 | * Reduce the effect of imbalance if swappiness is low, |
1008 | * this means for a swappiness very low, the imbalance | 1008 | * this means for a swappiness very low, the imbalance |
1009 | * must be much higher than 100 for this logic to make | 1009 | * must be much higher than 100 for this logic to make |
1010 | * the difference. | 1010 | * the difference. |
1011 | * | 1011 | * |
1012 | * Max temporary value is vm_total_pages*100. | 1012 | * Max temporary value is vm_total_pages*100. |
1013 | */ | 1013 | */ |
1014 | imbalance *= (vm_swappiness + 1); | 1014 | imbalance *= (vm_swappiness + 1); |
1015 | imbalance /= 100; | 1015 | imbalance /= 100; |
1016 | 1016 | ||
1017 | /* | 1017 | /* |
1018 | * If not much of the ram is mapped, makes the imbalance | 1018 | * If not much of the ram is mapped, makes the imbalance |
1019 | * less relevant, it's high priority we refill the inactive | 1019 | * less relevant, it's high priority we refill the inactive |
1020 | * list with mapped pages only in presence of high ratio of | 1020 | * list with mapped pages only in presence of high ratio of |
1021 | * mapped pages. | 1021 | * mapped pages. |
1022 | * | 1022 | * |
1023 | * Max temporary value is vm_total_pages*100. | 1023 | * Max temporary value is vm_total_pages*100. |
1024 | */ | 1024 | */ |
1025 | imbalance *= mapped_ratio; | 1025 | imbalance *= mapped_ratio; |
1026 | imbalance /= 100; | 1026 | imbalance /= 100; |
1027 | 1027 | ||
1028 | /* apply imbalance feedback to swap_tendency */ | 1028 | /* apply imbalance feedback to swap_tendency */ |
1029 | swap_tendency += imbalance; | 1029 | swap_tendency += imbalance; |
1030 | 1030 | ||
1031 | /* | 1031 | /* |
1032 | * Now use this metric to decide whether to start moving mapped | 1032 | * Now use this metric to decide whether to start moving mapped |
1033 | * memory onto the inactive list. | 1033 | * memory onto the inactive list. |
1034 | */ | 1034 | */ |
1035 | if (swap_tendency >= 100) | 1035 | if (swap_tendency >= 100) |
1036 | reclaim_mapped = 1; | 1036 | reclaim_mapped = 1; |
1037 | 1037 | ||
1038 | return reclaim_mapped; | 1038 | return reclaim_mapped; |
1039 | } | 1039 | } |
1040 | 1040 | ||
1041 | /* | 1041 | /* |
1042 | * This moves pages from the active list to the inactive list. | 1042 | * This moves pages from the active list to the inactive list. |
1043 | * | 1043 | * |
1044 | * We move them the other way if the page is referenced by one or more | 1044 | * We move them the other way if the page is referenced by one or more |
1045 | * processes, from rmap. | 1045 | * processes, from rmap. |
1046 | * | 1046 | * |
1047 | * If the pages are mostly unmapped, the processing is fast and it is | 1047 | * If the pages are mostly unmapped, the processing is fast and it is |
1048 | * appropriate to hold zone->lru_lock across the whole operation. But if | 1048 | * appropriate to hold zone->lru_lock across the whole operation. But if |
1049 | * the pages are mapped, the processing is slow (page_referenced()) so we | 1049 | * the pages are mapped, the processing is slow (page_referenced()) so we |
1050 | * should drop zone->lru_lock around each page. It's impossible to balance | 1050 | * should drop zone->lru_lock around each page. It's impossible to balance |
1051 | * this, so instead we remove the pages from the LRU while processing them. | 1051 | * this, so instead we remove the pages from the LRU while processing them. |
1052 | * It is safe to rely on PG_active against the non-LRU pages in here because | 1052 | * It is safe to rely on PG_active against the non-LRU pages in here because |
1053 | * nobody will play with that bit on a non-LRU page. | 1053 | * nobody will play with that bit on a non-LRU page. |
1054 | * | 1054 | * |
1055 | * The downside is that we have to touch page->_count against each page. | 1055 | * The downside is that we have to touch page->_count against each page. |
1056 | * But we had to alter page->flags anyway. | 1056 | * But we had to alter page->flags anyway. |
1057 | */ | 1057 | */ |
1058 | 1058 | ||
1059 | 1059 | ||
1060 | static void shrink_active_list(unsigned long nr_pages, struct zone *zone, | 1060 | static void shrink_active_list(unsigned long nr_pages, struct zone *zone, |
1061 | struct scan_control *sc, int priority) | 1061 | struct scan_control *sc, int priority) |
1062 | { | 1062 | { |
1063 | unsigned long pgmoved; | 1063 | unsigned long pgmoved; |
1064 | int pgdeactivate = 0; | 1064 | int pgdeactivate = 0; |
1065 | unsigned long pgscanned; | 1065 | unsigned long pgscanned; |
1066 | LIST_HEAD(l_hold); /* The pages which were snipped off */ | 1066 | LIST_HEAD(l_hold); /* The pages which were snipped off */ |
1067 | LIST_HEAD(l_inactive); /* Pages to go onto the inactive_list */ | 1067 | LIST_HEAD(l_inactive); /* Pages to go onto the inactive_list */ |
1068 | LIST_HEAD(l_active); /* Pages to go onto the active_list */ | 1068 | LIST_HEAD(l_active); /* Pages to go onto the active_list */ |
1069 | struct page *page; | 1069 | struct page *page; |
1070 | struct pagevec pvec; | 1070 | struct pagevec pvec; |
1071 | int reclaim_mapped = 0; | 1071 | int reclaim_mapped = 0; |
1072 | 1072 | ||
1073 | if (sc->may_swap) | 1073 | if (sc->may_swap) |
1074 | reclaim_mapped = calc_reclaim_mapped(sc, zone, priority); | 1074 | reclaim_mapped = calc_reclaim_mapped(sc, zone, priority); |
1075 | 1075 | ||
1076 | lru_add_drain(); | 1076 | lru_add_drain(); |
1077 | spin_lock_irq(&zone->lru_lock); | 1077 | spin_lock_irq(&zone->lru_lock); |
1078 | pgmoved = sc->isolate_pages(nr_pages, &l_hold, &pgscanned, sc->order, | 1078 | pgmoved = sc->isolate_pages(nr_pages, &l_hold, &pgscanned, sc->order, |
1079 | ISOLATE_ACTIVE, zone, | 1079 | ISOLATE_ACTIVE, zone, |
1080 | sc->mem_cgroup, 1); | 1080 | sc->mem_cgroup, 1); |
1081 | /* | 1081 | /* |
1082 | * zone->pages_scanned is used for detect zone's oom | 1082 | * zone->pages_scanned is used for detect zone's oom |
1083 | * mem_cgroup remembers nr_scan by itself. | 1083 | * mem_cgroup remembers nr_scan by itself. |
1084 | */ | 1084 | */ |
1085 | if (scan_global_lru(sc)) | 1085 | if (scan_global_lru(sc)) |
1086 | zone->pages_scanned += pgscanned; | 1086 | zone->pages_scanned += pgscanned; |
1087 | 1087 | ||
1088 | __mod_zone_page_state(zone, NR_ACTIVE, -pgmoved); | 1088 | __mod_zone_page_state(zone, NR_ACTIVE, -pgmoved); |
1089 | spin_unlock_irq(&zone->lru_lock); | 1089 | spin_unlock_irq(&zone->lru_lock); |
1090 | 1090 | ||
1091 | while (!list_empty(&l_hold)) { | 1091 | while (!list_empty(&l_hold)) { |
1092 | cond_resched(); | 1092 | cond_resched(); |
1093 | page = lru_to_page(&l_hold); | 1093 | page = lru_to_page(&l_hold); |
1094 | list_del(&page->lru); | 1094 | list_del(&page->lru); |
1095 | if (page_mapped(page)) { | 1095 | if (page_mapped(page)) { |
1096 | if (!reclaim_mapped || | 1096 | if (!reclaim_mapped || |
1097 | (total_swap_pages == 0 && PageAnon(page)) || | 1097 | (total_swap_pages == 0 && PageAnon(page)) || |
1098 | page_referenced(page, 0, sc->mem_cgroup)) { | 1098 | page_referenced(page, 0, sc->mem_cgroup)) { |
1099 | list_add(&page->lru, &l_active); | 1099 | list_add(&page->lru, &l_active); |
1100 | continue; | 1100 | continue; |
1101 | } | 1101 | } |
1102 | } | 1102 | } |
1103 | list_add(&page->lru, &l_inactive); | 1103 | list_add(&page->lru, &l_inactive); |
1104 | } | 1104 | } |
1105 | 1105 | ||
1106 | pagevec_init(&pvec, 1); | 1106 | pagevec_init(&pvec, 1); |
1107 | pgmoved = 0; | 1107 | pgmoved = 0; |
1108 | spin_lock_irq(&zone->lru_lock); | 1108 | spin_lock_irq(&zone->lru_lock); |
1109 | while (!list_empty(&l_inactive)) { | 1109 | while (!list_empty(&l_inactive)) { |
1110 | page = lru_to_page(&l_inactive); | 1110 | page = lru_to_page(&l_inactive); |
1111 | prefetchw_prev_lru_page(page, &l_inactive, flags); | 1111 | prefetchw_prev_lru_page(page, &l_inactive, flags); |
1112 | VM_BUG_ON(PageLRU(page)); | 1112 | VM_BUG_ON(PageLRU(page)); |
1113 | SetPageLRU(page); | 1113 | SetPageLRU(page); |
1114 | VM_BUG_ON(!PageActive(page)); | 1114 | VM_BUG_ON(!PageActive(page)); |
1115 | ClearPageActive(page); | 1115 | ClearPageActive(page); |
1116 | 1116 | ||
1117 | list_move(&page->lru, &zone->inactive_list); | 1117 | list_move(&page->lru, &zone->inactive_list); |
1118 | mem_cgroup_move_lists(page, false); | 1118 | mem_cgroup_move_lists(page, false); |
1119 | pgmoved++; | 1119 | pgmoved++; |
1120 | if (!pagevec_add(&pvec, page)) { | 1120 | if (!pagevec_add(&pvec, page)) { |
1121 | __mod_zone_page_state(zone, NR_INACTIVE, pgmoved); | 1121 | __mod_zone_page_state(zone, NR_INACTIVE, pgmoved); |
1122 | spin_unlock_irq(&zone->lru_lock); | 1122 | spin_unlock_irq(&zone->lru_lock); |
1123 | pgdeactivate += pgmoved; | 1123 | pgdeactivate += pgmoved; |
1124 | pgmoved = 0; | 1124 | pgmoved = 0; |
1125 | if (buffer_heads_over_limit) | 1125 | if (buffer_heads_over_limit) |
1126 | pagevec_strip(&pvec); | 1126 | pagevec_strip(&pvec); |
1127 | __pagevec_release(&pvec); | 1127 | __pagevec_release(&pvec); |
1128 | spin_lock_irq(&zone->lru_lock); | 1128 | spin_lock_irq(&zone->lru_lock); |
1129 | } | 1129 | } |
1130 | } | 1130 | } |
1131 | __mod_zone_page_state(zone, NR_INACTIVE, pgmoved); | 1131 | __mod_zone_page_state(zone, NR_INACTIVE, pgmoved); |
1132 | pgdeactivate += pgmoved; | 1132 | pgdeactivate += pgmoved; |
1133 | if (buffer_heads_over_limit) { | 1133 | if (buffer_heads_over_limit) { |
1134 | spin_unlock_irq(&zone->lru_lock); | 1134 | spin_unlock_irq(&zone->lru_lock); |
1135 | pagevec_strip(&pvec); | 1135 | pagevec_strip(&pvec); |
1136 | spin_lock_irq(&zone->lru_lock); | 1136 | spin_lock_irq(&zone->lru_lock); |
1137 | } | 1137 | } |
1138 | 1138 | ||
1139 | pgmoved = 0; | 1139 | pgmoved = 0; |
1140 | while (!list_empty(&l_active)) { | 1140 | while (!list_empty(&l_active)) { |
1141 | page = lru_to_page(&l_active); | 1141 | page = lru_to_page(&l_active); |
1142 | prefetchw_prev_lru_page(page, &l_active, flags); | 1142 | prefetchw_prev_lru_page(page, &l_active, flags); |
1143 | VM_BUG_ON(PageLRU(page)); | 1143 | VM_BUG_ON(PageLRU(page)); |
1144 | SetPageLRU(page); | 1144 | SetPageLRU(page); |
1145 | VM_BUG_ON(!PageActive(page)); | 1145 | VM_BUG_ON(!PageActive(page)); |
1146 | 1146 | ||
1147 | list_move(&page->lru, &zone->active_list); | 1147 | list_move(&page->lru, &zone->active_list); |
1148 | mem_cgroup_move_lists(page, true); | 1148 | mem_cgroup_move_lists(page, true); |
1149 | pgmoved++; | 1149 | pgmoved++; |
1150 | if (!pagevec_add(&pvec, page)) { | 1150 | if (!pagevec_add(&pvec, page)) { |
1151 | __mod_zone_page_state(zone, NR_ACTIVE, pgmoved); | 1151 | __mod_zone_page_state(zone, NR_ACTIVE, pgmoved); |
1152 | pgmoved = 0; | 1152 | pgmoved = 0; |
1153 | spin_unlock_irq(&zone->lru_lock); | 1153 | spin_unlock_irq(&zone->lru_lock); |
1154 | __pagevec_release(&pvec); | 1154 | __pagevec_release(&pvec); |
1155 | spin_lock_irq(&zone->lru_lock); | 1155 | spin_lock_irq(&zone->lru_lock); |
1156 | } | 1156 | } |
1157 | } | 1157 | } |
1158 | __mod_zone_page_state(zone, NR_ACTIVE, pgmoved); | 1158 | __mod_zone_page_state(zone, NR_ACTIVE, pgmoved); |
1159 | 1159 | ||
1160 | __count_zone_vm_events(PGREFILL, zone, pgscanned); | 1160 | __count_zone_vm_events(PGREFILL, zone, pgscanned); |
1161 | __count_vm_events(PGDEACTIVATE, pgdeactivate); | 1161 | __count_vm_events(PGDEACTIVATE, pgdeactivate); |
1162 | spin_unlock_irq(&zone->lru_lock); | 1162 | spin_unlock_irq(&zone->lru_lock); |
1163 | 1163 | ||
1164 | pagevec_release(&pvec); | 1164 | pagevec_release(&pvec); |
1165 | } | 1165 | } |
1166 | 1166 | ||
1167 | /* | 1167 | /* |
1168 | * This is a basic per-zone page freer. Used by both kswapd and direct reclaim. | 1168 | * This is a basic per-zone page freer. Used by both kswapd and direct reclaim. |
1169 | */ | 1169 | */ |
1170 | static unsigned long shrink_zone(int priority, struct zone *zone, | 1170 | static unsigned long shrink_zone(int priority, struct zone *zone, |
1171 | struct scan_control *sc) | 1171 | struct scan_control *sc) |
1172 | { | 1172 | { |
1173 | unsigned long nr_active; | 1173 | unsigned long nr_active; |
1174 | unsigned long nr_inactive; | 1174 | unsigned long nr_inactive; |
1175 | unsigned long nr_to_scan; | 1175 | unsigned long nr_to_scan; |
1176 | unsigned long nr_reclaimed = 0; | 1176 | unsigned long nr_reclaimed = 0; |
1177 | 1177 | ||
1178 | if (scan_global_lru(sc)) { | 1178 | if (scan_global_lru(sc)) { |
1179 | /* | 1179 | /* |
1180 | * Add one to nr_to_scan just to make sure that the kernel | 1180 | * Add one to nr_to_scan just to make sure that the kernel |
1181 | * will slowly sift through the active list. | 1181 | * will slowly sift through the active list. |
1182 | */ | 1182 | */ |
1183 | zone->nr_scan_active += | 1183 | zone->nr_scan_active += |
1184 | (zone_page_state(zone, NR_ACTIVE) >> priority) + 1; | 1184 | (zone_page_state(zone, NR_ACTIVE) >> priority) + 1; |
1185 | nr_active = zone->nr_scan_active; | 1185 | nr_active = zone->nr_scan_active; |
1186 | zone->nr_scan_inactive += | 1186 | zone->nr_scan_inactive += |
1187 | (zone_page_state(zone, NR_INACTIVE) >> priority) + 1; | 1187 | (zone_page_state(zone, NR_INACTIVE) >> priority) + 1; |
1188 | nr_inactive = zone->nr_scan_inactive; | 1188 | nr_inactive = zone->nr_scan_inactive; |
1189 | if (nr_inactive >= sc->swap_cluster_max) | 1189 | if (nr_inactive >= sc->swap_cluster_max) |
1190 | zone->nr_scan_inactive = 0; | 1190 | zone->nr_scan_inactive = 0; |
1191 | else | 1191 | else |
1192 | nr_inactive = 0; | 1192 | nr_inactive = 0; |
1193 | 1193 | ||
1194 | if (nr_active >= sc->swap_cluster_max) | 1194 | if (nr_active >= sc->swap_cluster_max) |
1195 | zone->nr_scan_active = 0; | 1195 | zone->nr_scan_active = 0; |
1196 | else | 1196 | else |
1197 | nr_active = 0; | 1197 | nr_active = 0; |
1198 | } else { | 1198 | } else { |
1199 | /* | 1199 | /* |
1200 | * This reclaim occurs not because zone memory shortage but | 1200 | * This reclaim occurs not because zone memory shortage but |
1201 | * because memory controller hits its limit. | 1201 | * because memory controller hits its limit. |
1202 | * Then, don't modify zone reclaim related data. | 1202 | * Then, don't modify zone reclaim related data. |
1203 | */ | 1203 | */ |
1204 | nr_active = mem_cgroup_calc_reclaim_active(sc->mem_cgroup, | 1204 | nr_active = mem_cgroup_calc_reclaim_active(sc->mem_cgroup, |
1205 | zone, priority); | 1205 | zone, priority); |
1206 | 1206 | ||
1207 | nr_inactive = mem_cgroup_calc_reclaim_inactive(sc->mem_cgroup, | 1207 | nr_inactive = mem_cgroup_calc_reclaim_inactive(sc->mem_cgroup, |
1208 | zone, priority); | 1208 | zone, priority); |
1209 | } | 1209 | } |
1210 | 1210 | ||
1211 | 1211 | ||
1212 | while (nr_active || nr_inactive) { | 1212 | while (nr_active || nr_inactive) { |
1213 | if (nr_active) { | 1213 | if (nr_active) { |
1214 | nr_to_scan = min(nr_active, | 1214 | nr_to_scan = min(nr_active, |
1215 | (unsigned long)sc->swap_cluster_max); | 1215 | (unsigned long)sc->swap_cluster_max); |
1216 | nr_active -= nr_to_scan; | 1216 | nr_active -= nr_to_scan; |
1217 | shrink_active_list(nr_to_scan, zone, sc, priority); | 1217 | shrink_active_list(nr_to_scan, zone, sc, priority); |
1218 | } | 1218 | } |
1219 | 1219 | ||
1220 | if (nr_inactive) { | 1220 | if (nr_inactive) { |
1221 | nr_to_scan = min(nr_inactive, | 1221 | nr_to_scan = min(nr_inactive, |
1222 | (unsigned long)sc->swap_cluster_max); | 1222 | (unsigned long)sc->swap_cluster_max); |
1223 | nr_inactive -= nr_to_scan; | 1223 | nr_inactive -= nr_to_scan; |
1224 | nr_reclaimed += shrink_inactive_list(nr_to_scan, zone, | 1224 | nr_reclaimed += shrink_inactive_list(nr_to_scan, zone, |
1225 | sc); | 1225 | sc); |
1226 | } | 1226 | } |
1227 | } | 1227 | } |
1228 | 1228 | ||
1229 | throttle_vm_writeout(sc->gfp_mask); | 1229 | throttle_vm_writeout(sc->gfp_mask); |
1230 | return nr_reclaimed; | 1230 | return nr_reclaimed; |
1231 | } | 1231 | } |
1232 | 1232 | ||
1233 | /* | 1233 | /* |
1234 | * This is the direct reclaim path, for page-allocating processes. We only | 1234 | * This is the direct reclaim path, for page-allocating processes. We only |
1235 | * try to reclaim pages from zones which will satisfy the caller's allocation | 1235 | * try to reclaim pages from zones which will satisfy the caller's allocation |
1236 | * request. | 1236 | * request. |
1237 | * | 1237 | * |
1238 | * We reclaim from a zone even if that zone is over pages_high. Because: | 1238 | * We reclaim from a zone even if that zone is over pages_high. Because: |
1239 | * a) The caller may be trying to free *extra* pages to satisfy a higher-order | 1239 | * a) The caller may be trying to free *extra* pages to satisfy a higher-order |
1240 | * allocation or | 1240 | * allocation or |
1241 | * b) The zones may be over pages_high but they must go *over* pages_high to | 1241 | * b) The zones may be over pages_high but they must go *over* pages_high to |
1242 | * satisfy the `incremental min' zone defense algorithm. | 1242 | * satisfy the `incremental min' zone defense algorithm. |
1243 | * | 1243 | * |
1244 | * Returns the number of reclaimed pages. | 1244 | * Returns the number of reclaimed pages. |
1245 | * | 1245 | * |
1246 | * If a zone is deemed to be full of pinned pages then just give it a light | 1246 | * If a zone is deemed to be full of pinned pages then just give it a light |
1247 | * scan then give up on it. | 1247 | * scan then give up on it. |
1248 | */ | 1248 | */ |
1249 | static unsigned long shrink_zones(int priority, struct zonelist *zonelist, | 1249 | static unsigned long shrink_zones(int priority, struct zonelist *zonelist, |
1250 | struct scan_control *sc) | 1250 | struct scan_control *sc) |
1251 | { | 1251 | { |
1252 | enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask); | 1252 | enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask); |
1253 | unsigned long nr_reclaimed = 0; | 1253 | unsigned long nr_reclaimed = 0; |
1254 | struct zoneref *z; | 1254 | struct zoneref *z; |
1255 | struct zone *zone; | 1255 | struct zone *zone; |
1256 | 1256 | ||
1257 | sc->all_unreclaimable = 1; | 1257 | sc->all_unreclaimable = 1; |
1258 | for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { | 1258 | for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { |
1259 | if (!populated_zone(zone)) | 1259 | if (!populated_zone(zone)) |
1260 | continue; | 1260 | continue; |
1261 | /* | 1261 | /* |
1262 | * Take care memory controller reclaiming has small influence | 1262 | * Take care memory controller reclaiming has small influence |
1263 | * to global LRU. | 1263 | * to global LRU. |
1264 | */ | 1264 | */ |
1265 | if (scan_global_lru(sc)) { | 1265 | if (scan_global_lru(sc)) { |
1266 | if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) | 1266 | if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) |
1267 | continue; | 1267 | continue; |
1268 | note_zone_scanning_priority(zone, priority); | 1268 | note_zone_scanning_priority(zone, priority); |
1269 | 1269 | ||
1270 | if (zone_is_all_unreclaimable(zone) && | 1270 | if (zone_is_all_unreclaimable(zone) && |
1271 | priority != DEF_PRIORITY) | 1271 | priority != DEF_PRIORITY) |
1272 | continue; /* Let kswapd poll it */ | 1272 | continue; /* Let kswapd poll it */ |
1273 | sc->all_unreclaimable = 0; | 1273 | sc->all_unreclaimable = 0; |
1274 | } else { | 1274 | } else { |
1275 | /* | 1275 | /* |
1276 | * Ignore cpuset limitation here. We just want to reduce | 1276 | * Ignore cpuset limitation here. We just want to reduce |
1277 | * # of used pages by us regardless of memory shortage. | 1277 | * # of used pages by us regardless of memory shortage. |
1278 | */ | 1278 | */ |
1279 | sc->all_unreclaimable = 0; | 1279 | sc->all_unreclaimable = 0; |
1280 | mem_cgroup_note_reclaim_priority(sc->mem_cgroup, | 1280 | mem_cgroup_note_reclaim_priority(sc->mem_cgroup, |
1281 | priority); | 1281 | priority); |
1282 | } | 1282 | } |
1283 | 1283 | ||
1284 | nr_reclaimed += shrink_zone(priority, zone, sc); | 1284 | nr_reclaimed += shrink_zone(priority, zone, sc); |
1285 | } | 1285 | } |
1286 | 1286 | ||
1287 | return nr_reclaimed; | 1287 | return nr_reclaimed; |
1288 | } | 1288 | } |
1289 | 1289 | ||
1290 | /* | 1290 | /* |
1291 | * This is the main entry point to direct page reclaim. | 1291 | * This is the main entry point to direct page reclaim. |
1292 | * | 1292 | * |
1293 | * If a full scan of the inactive list fails to free enough memory then we | 1293 | * If a full scan of the inactive list fails to free enough memory then we |
1294 | * are "out of memory" and something needs to be killed. | 1294 | * are "out of memory" and something needs to be killed. |
1295 | * | 1295 | * |
1296 | * If the caller is !__GFP_FS then the probability of a failure is reasonably | 1296 | * If the caller is !__GFP_FS then the probability of a failure is reasonably |
1297 | * high - the zone may be full of dirty or under-writeback pages, which this | 1297 | * high - the zone may be full of dirty or under-writeback pages, which this |
1298 | * caller can't do much about. We kick pdflush and take explicit naps in the | 1298 | * caller can't do much about. We kick pdflush and take explicit naps in the |
1299 | * hope that some of these pages can be written. But if the allocating task | 1299 | * hope that some of these pages can be written. But if the allocating task |
1300 | * holds filesystem locks which prevent writeout this might not work, and the | 1300 | * holds filesystem locks which prevent writeout this might not work, and the |
1301 | * allocation attempt will fail. | 1301 | * allocation attempt will fail. |
1302 | * | ||
1303 | * returns: 0, if no pages reclaimed | ||
1304 | * else, the number of pages reclaimed | ||
1302 | */ | 1305 | */ |
1303 | static unsigned long do_try_to_free_pages(struct zonelist *zonelist, | 1306 | static unsigned long do_try_to_free_pages(struct zonelist *zonelist, |
1304 | struct scan_control *sc) | 1307 | struct scan_control *sc) |
1305 | { | 1308 | { |
1306 | int priority; | 1309 | int priority; |
1307 | int ret = 0; | 1310 | int ret = 0; |
1308 | unsigned long total_scanned = 0; | 1311 | unsigned long total_scanned = 0; |
1309 | unsigned long nr_reclaimed = 0; | 1312 | unsigned long nr_reclaimed = 0; |
1310 | struct reclaim_state *reclaim_state = current->reclaim_state; | 1313 | struct reclaim_state *reclaim_state = current->reclaim_state; |
1311 | unsigned long lru_pages = 0; | 1314 | unsigned long lru_pages = 0; |
1312 | struct zoneref *z; | 1315 | struct zoneref *z; |
1313 | struct zone *zone; | 1316 | struct zone *zone; |
1314 | enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask); | 1317 | enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask); |
1315 | 1318 | ||
1316 | if (scan_global_lru(sc)) | 1319 | if (scan_global_lru(sc)) |
1317 | count_vm_event(ALLOCSTALL); | 1320 | count_vm_event(ALLOCSTALL); |
1318 | /* | 1321 | /* |
1319 | * mem_cgroup will not do shrink_slab. | 1322 | * mem_cgroup will not do shrink_slab. |
1320 | */ | 1323 | */ |
1321 | if (scan_global_lru(sc)) { | 1324 | if (scan_global_lru(sc)) { |
1322 | for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { | 1325 | for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { |
1323 | 1326 | ||
1324 | if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) | 1327 | if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) |
1325 | continue; | 1328 | continue; |
1326 | 1329 | ||
1327 | lru_pages += zone_page_state(zone, NR_ACTIVE) | 1330 | lru_pages += zone_page_state(zone, NR_ACTIVE) |
1328 | + zone_page_state(zone, NR_INACTIVE); | 1331 | + zone_page_state(zone, NR_INACTIVE); |
1329 | } | 1332 | } |
1330 | } | 1333 | } |
1331 | 1334 | ||
1332 | for (priority = DEF_PRIORITY; priority >= 0; priority--) { | 1335 | for (priority = DEF_PRIORITY; priority >= 0; priority--) { |
1333 | sc->nr_scanned = 0; | 1336 | sc->nr_scanned = 0; |
1334 | if (!priority) | 1337 | if (!priority) |
1335 | disable_swap_token(); | 1338 | disable_swap_token(); |
1336 | nr_reclaimed += shrink_zones(priority, zonelist, sc); | 1339 | nr_reclaimed += shrink_zones(priority, zonelist, sc); |
1337 | /* | 1340 | /* |
1338 | * Don't shrink slabs when reclaiming memory from | 1341 | * Don't shrink slabs when reclaiming memory from |
1339 | * over limit cgroups | 1342 | * over limit cgroups |
1340 | */ | 1343 | */ |
1341 | if (scan_global_lru(sc)) { | 1344 | if (scan_global_lru(sc)) { |
1342 | shrink_slab(sc->nr_scanned, sc->gfp_mask, lru_pages); | 1345 | shrink_slab(sc->nr_scanned, sc->gfp_mask, lru_pages); |
1343 | if (reclaim_state) { | 1346 | if (reclaim_state) { |
1344 | nr_reclaimed += reclaim_state->reclaimed_slab; | 1347 | nr_reclaimed += reclaim_state->reclaimed_slab; |
1345 | reclaim_state->reclaimed_slab = 0; | 1348 | reclaim_state->reclaimed_slab = 0; |
1346 | } | 1349 | } |
1347 | } | 1350 | } |
1348 | total_scanned += sc->nr_scanned; | 1351 | total_scanned += sc->nr_scanned; |
1349 | if (nr_reclaimed >= sc->swap_cluster_max) { | 1352 | if (nr_reclaimed >= sc->swap_cluster_max) { |
1350 | ret = 1; | 1353 | ret = nr_reclaimed; |
1351 | goto out; | 1354 | goto out; |
1352 | } | 1355 | } |
1353 | 1356 | ||
1354 | /* | 1357 | /* |
1355 | * Try to write back as many pages as we just scanned. This | 1358 | * Try to write back as many pages as we just scanned. This |
1356 | * tends to cause slow streaming writers to write data to the | 1359 | * tends to cause slow streaming writers to write data to the |
1357 | * disk smoothly, at the dirtying rate, which is nice. But | 1360 | * disk smoothly, at the dirtying rate, which is nice. But |
1358 | * that's undesirable in laptop mode, where we *want* lumpy | 1361 | * that's undesirable in laptop mode, where we *want* lumpy |
1359 | * writeout. So in laptop mode, write out the whole world. | 1362 | * writeout. So in laptop mode, write out the whole world. |
1360 | */ | 1363 | */ |
1361 | if (total_scanned > sc->swap_cluster_max + | 1364 | if (total_scanned > sc->swap_cluster_max + |
1362 | sc->swap_cluster_max / 2) { | 1365 | sc->swap_cluster_max / 2) { |
1363 | wakeup_pdflush(laptop_mode ? 0 : total_scanned); | 1366 | wakeup_pdflush(laptop_mode ? 0 : total_scanned); |
1364 | sc->may_writepage = 1; | 1367 | sc->may_writepage = 1; |
1365 | } | 1368 | } |
1366 | 1369 | ||
1367 | /* Take a nap, wait for some writeback to complete */ | 1370 | /* Take a nap, wait for some writeback to complete */ |
1368 | if (sc->nr_scanned && priority < DEF_PRIORITY - 2) | 1371 | if (sc->nr_scanned && priority < DEF_PRIORITY - 2) |
1369 | congestion_wait(WRITE, HZ/10); | 1372 | congestion_wait(WRITE, HZ/10); |
1370 | } | 1373 | } |
1371 | /* top priority shrink_caches still had more to do? don't OOM, then */ | 1374 | /* top priority shrink_caches still had more to do? don't OOM, then */ |
1372 | if (!sc->all_unreclaimable && scan_global_lru(sc)) | 1375 | if (!sc->all_unreclaimable && scan_global_lru(sc)) |
1373 | ret = 1; | 1376 | ret = nr_reclaimed; |
1374 | out: | 1377 | out: |
1375 | /* | 1378 | /* |
1376 | * Now that we've scanned all the zones at this priority level, note | 1379 | * Now that we've scanned all the zones at this priority level, note |
1377 | * that level within the zone so that the next thread which performs | 1380 | * that level within the zone so that the next thread which performs |
1378 | * scanning of this zone will immediately start out at this priority | 1381 | * scanning of this zone will immediately start out at this priority |
1379 | * level. This affects only the decision whether or not to bring | 1382 | * level. This affects only the decision whether or not to bring |
1380 | * mapped pages onto the inactive list. | 1383 | * mapped pages onto the inactive list. |
1381 | */ | 1384 | */ |
1382 | if (priority < 0) | 1385 | if (priority < 0) |
1383 | priority = 0; | 1386 | priority = 0; |
1384 | 1387 | ||
1385 | if (scan_global_lru(sc)) { | 1388 | if (scan_global_lru(sc)) { |
1386 | for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { | 1389 | for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { |
1387 | 1390 | ||
1388 | if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) | 1391 | if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) |
1389 | continue; | 1392 | continue; |
1390 | 1393 | ||
1391 | zone->prev_priority = priority; | 1394 | zone->prev_priority = priority; |
1392 | } | 1395 | } |
1393 | } else | 1396 | } else |
1394 | mem_cgroup_record_reclaim_priority(sc->mem_cgroup, priority); | 1397 | mem_cgroup_record_reclaim_priority(sc->mem_cgroup, priority); |
1395 | 1398 | ||
1396 | return ret; | 1399 | return ret; |
1397 | } | 1400 | } |
1398 | 1401 | ||
1399 | unsigned long try_to_free_pages(struct zonelist *zonelist, int order, | 1402 | unsigned long try_to_free_pages(struct zonelist *zonelist, int order, |
1400 | gfp_t gfp_mask) | 1403 | gfp_t gfp_mask) |
1401 | { | 1404 | { |
1402 | struct scan_control sc = { | 1405 | struct scan_control sc = { |
1403 | .gfp_mask = gfp_mask, | 1406 | .gfp_mask = gfp_mask, |
1404 | .may_writepage = !laptop_mode, | 1407 | .may_writepage = !laptop_mode, |
1405 | .swap_cluster_max = SWAP_CLUSTER_MAX, | 1408 | .swap_cluster_max = SWAP_CLUSTER_MAX, |
1406 | .may_swap = 1, | 1409 | .may_swap = 1, |
1407 | .swappiness = vm_swappiness, | 1410 | .swappiness = vm_swappiness, |
1408 | .order = order, | 1411 | .order = order, |
1409 | .mem_cgroup = NULL, | 1412 | .mem_cgroup = NULL, |
1410 | .isolate_pages = isolate_pages_global, | 1413 | .isolate_pages = isolate_pages_global, |
1411 | }; | 1414 | }; |
1412 | 1415 | ||
1413 | return do_try_to_free_pages(zonelist, &sc); | 1416 | return do_try_to_free_pages(zonelist, &sc); |
1414 | } | 1417 | } |
1415 | 1418 | ||
1416 | #ifdef CONFIG_CGROUP_MEM_RES_CTLR | 1419 | #ifdef CONFIG_CGROUP_MEM_RES_CTLR |
1417 | 1420 | ||
1418 | unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, | 1421 | unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, |
1419 | gfp_t gfp_mask) | 1422 | gfp_t gfp_mask) |
1420 | { | 1423 | { |
1421 | struct scan_control sc = { | 1424 | struct scan_control sc = { |
1422 | .may_writepage = !laptop_mode, | 1425 | .may_writepage = !laptop_mode, |
1423 | .may_swap = 1, | 1426 | .may_swap = 1, |
1424 | .swap_cluster_max = SWAP_CLUSTER_MAX, | 1427 | .swap_cluster_max = SWAP_CLUSTER_MAX, |
1425 | .swappiness = vm_swappiness, | 1428 | .swappiness = vm_swappiness, |
1426 | .order = 0, | 1429 | .order = 0, |
1427 | .mem_cgroup = mem_cont, | 1430 | .mem_cgroup = mem_cont, |
1428 | .isolate_pages = mem_cgroup_isolate_pages, | 1431 | .isolate_pages = mem_cgroup_isolate_pages, |
1429 | }; | 1432 | }; |
1430 | struct zonelist *zonelist; | 1433 | struct zonelist *zonelist; |
1431 | 1434 | ||
1432 | sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | | 1435 | sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | |
1433 | (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK); | 1436 | (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK); |
1434 | zonelist = NODE_DATA(numa_node_id())->node_zonelists; | 1437 | zonelist = NODE_DATA(numa_node_id())->node_zonelists; |
1435 | return do_try_to_free_pages(zonelist, &sc); | 1438 | return do_try_to_free_pages(zonelist, &sc); |
1436 | } | 1439 | } |
1437 | #endif | 1440 | #endif |
1438 | 1441 | ||
1439 | /* | 1442 | /* |
1440 | * For kswapd, balance_pgdat() will work across all this node's zones until | 1443 | * For kswapd, balance_pgdat() will work across all this node's zones until |
1441 | * they are all at pages_high. | 1444 | * they are all at pages_high. |
1442 | * | 1445 | * |
1443 | * Returns the number of pages which were actually freed. | 1446 | * Returns the number of pages which were actually freed. |
1444 | * | 1447 | * |
1445 | * There is special handling here for zones which are full of pinned pages. | 1448 | * There is special handling here for zones which are full of pinned pages. |
1446 | * This can happen if the pages are all mlocked, or if they are all used by | 1449 | * This can happen if the pages are all mlocked, or if they are all used by |
1447 | * device drivers (say, ZONE_DMA). Or if they are all in use by hugetlb. | 1450 | * device drivers (say, ZONE_DMA). Or if they are all in use by hugetlb. |
1448 | * What we do is to detect the case where all pages in the zone have been | 1451 | * What we do is to detect the case where all pages in the zone have been |
1449 | * scanned twice and there has been zero successful reclaim. Mark the zone as | 1452 | * scanned twice and there has been zero successful reclaim. Mark the zone as |
1450 | * dead and from now on, only perform a short scan. Basically we're polling | 1453 | * dead and from now on, only perform a short scan. Basically we're polling |
1451 | * the zone for when the problem goes away. | 1454 | * the zone for when the problem goes away. |
1452 | * | 1455 | * |
1453 | * kswapd scans the zones in the highmem->normal->dma direction. It skips | 1456 | * kswapd scans the zones in the highmem->normal->dma direction. It skips |
1454 | * zones which have free_pages > pages_high, but once a zone is found to have | 1457 | * zones which have free_pages > pages_high, but once a zone is found to have |
1455 | * free_pages <= pages_high, we scan that zone and the lower zones regardless | 1458 | * free_pages <= pages_high, we scan that zone and the lower zones regardless |
1456 | * of the number of free pages in the lower zones. This interoperates with | 1459 | * of the number of free pages in the lower zones. This interoperates with |
1457 | * the page allocator fallback scheme to ensure that aging of pages is balanced | 1460 | * the page allocator fallback scheme to ensure that aging of pages is balanced |
1458 | * across the zones. | 1461 | * across the zones. |
1459 | */ | 1462 | */ |
1460 | static unsigned long balance_pgdat(pg_data_t *pgdat, int order) | 1463 | static unsigned long balance_pgdat(pg_data_t *pgdat, int order) |
1461 | { | 1464 | { |
1462 | int all_zones_ok; | 1465 | int all_zones_ok; |
1463 | int priority; | 1466 | int priority; |
1464 | int i; | 1467 | int i; |
1465 | unsigned long total_scanned; | 1468 | unsigned long total_scanned; |
1466 | unsigned long nr_reclaimed; | 1469 | unsigned long nr_reclaimed; |
1467 | struct reclaim_state *reclaim_state = current->reclaim_state; | 1470 | struct reclaim_state *reclaim_state = current->reclaim_state; |
1468 | struct scan_control sc = { | 1471 | struct scan_control sc = { |
1469 | .gfp_mask = GFP_KERNEL, | 1472 | .gfp_mask = GFP_KERNEL, |
1470 | .may_swap = 1, | 1473 | .may_swap = 1, |
1471 | .swap_cluster_max = SWAP_CLUSTER_MAX, | 1474 | .swap_cluster_max = SWAP_CLUSTER_MAX, |
1472 | .swappiness = vm_swappiness, | 1475 | .swappiness = vm_swappiness, |
1473 | .order = order, | 1476 | .order = order, |
1474 | .mem_cgroup = NULL, | 1477 | .mem_cgroup = NULL, |
1475 | .isolate_pages = isolate_pages_global, | 1478 | .isolate_pages = isolate_pages_global, |
1476 | }; | 1479 | }; |
1477 | /* | 1480 | /* |
1478 | * temp_priority is used to remember the scanning priority at which | 1481 | * temp_priority is used to remember the scanning priority at which |
1479 | * this zone was successfully refilled to free_pages == pages_high. | 1482 | * this zone was successfully refilled to free_pages == pages_high. |
1480 | */ | 1483 | */ |
1481 | int temp_priority[MAX_NR_ZONES]; | 1484 | int temp_priority[MAX_NR_ZONES]; |
1482 | 1485 | ||
1483 | loop_again: | 1486 | loop_again: |
1484 | total_scanned = 0; | 1487 | total_scanned = 0; |
1485 | nr_reclaimed = 0; | 1488 | nr_reclaimed = 0; |
1486 | sc.may_writepage = !laptop_mode; | 1489 | sc.may_writepage = !laptop_mode; |
1487 | count_vm_event(PAGEOUTRUN); | 1490 | count_vm_event(PAGEOUTRUN); |
1488 | 1491 | ||
1489 | for (i = 0; i < pgdat->nr_zones; i++) | 1492 | for (i = 0; i < pgdat->nr_zones; i++) |
1490 | temp_priority[i] = DEF_PRIORITY; | 1493 | temp_priority[i] = DEF_PRIORITY; |
1491 | 1494 | ||
1492 | for (priority = DEF_PRIORITY; priority >= 0; priority--) { | 1495 | for (priority = DEF_PRIORITY; priority >= 0; priority--) { |
1493 | int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ | 1496 | int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ |
1494 | unsigned long lru_pages = 0; | 1497 | unsigned long lru_pages = 0; |
1495 | 1498 | ||
1496 | /* The swap token gets in the way of swapout... */ | 1499 | /* The swap token gets in the way of swapout... */ |
1497 | if (!priority) | 1500 | if (!priority) |
1498 | disable_swap_token(); | 1501 | disable_swap_token(); |
1499 | 1502 | ||
1500 | all_zones_ok = 1; | 1503 | all_zones_ok = 1; |
1501 | 1504 | ||
1502 | /* | 1505 | /* |
1503 | * Scan in the highmem->dma direction for the highest | 1506 | * Scan in the highmem->dma direction for the highest |
1504 | * zone which needs scanning | 1507 | * zone which needs scanning |
1505 | */ | 1508 | */ |
1506 | for (i = pgdat->nr_zones - 1; i >= 0; i--) { | 1509 | for (i = pgdat->nr_zones - 1; i >= 0; i--) { |
1507 | struct zone *zone = pgdat->node_zones + i; | 1510 | struct zone *zone = pgdat->node_zones + i; |
1508 | 1511 | ||
1509 | if (!populated_zone(zone)) | 1512 | if (!populated_zone(zone)) |
1510 | continue; | 1513 | continue; |
1511 | 1514 | ||
1512 | if (zone_is_all_unreclaimable(zone) && | 1515 | if (zone_is_all_unreclaimable(zone) && |
1513 | priority != DEF_PRIORITY) | 1516 | priority != DEF_PRIORITY) |
1514 | continue; | 1517 | continue; |
1515 | 1518 | ||
1516 | if (!zone_watermark_ok(zone, order, zone->pages_high, | 1519 | if (!zone_watermark_ok(zone, order, zone->pages_high, |
1517 | 0, 0)) { | 1520 | 0, 0)) { |
1518 | end_zone = i; | 1521 | end_zone = i; |
1519 | break; | 1522 | break; |
1520 | } | 1523 | } |
1521 | } | 1524 | } |
1522 | if (i < 0) | 1525 | if (i < 0) |
1523 | goto out; | 1526 | goto out; |
1524 | 1527 | ||
1525 | for (i = 0; i <= end_zone; i++) { | 1528 | for (i = 0; i <= end_zone; i++) { |
1526 | struct zone *zone = pgdat->node_zones + i; | 1529 | struct zone *zone = pgdat->node_zones + i; |
1527 | 1530 | ||
1528 | lru_pages += zone_page_state(zone, NR_ACTIVE) | 1531 | lru_pages += zone_page_state(zone, NR_ACTIVE) |
1529 | + zone_page_state(zone, NR_INACTIVE); | 1532 | + zone_page_state(zone, NR_INACTIVE); |
1530 | } | 1533 | } |
1531 | 1534 | ||
1532 | /* | 1535 | /* |
1533 | * Now scan the zone in the dma->highmem direction, stopping | 1536 | * Now scan the zone in the dma->highmem direction, stopping |
1534 | * at the last zone which needs scanning. | 1537 | * at the last zone which needs scanning. |
1535 | * | 1538 | * |
1536 | * We do this because the page allocator works in the opposite | 1539 | * We do this because the page allocator works in the opposite |
1537 | * direction. This prevents the page allocator from allocating | 1540 | * direction. This prevents the page allocator from allocating |
1538 | * pages behind kswapd's direction of progress, which would | 1541 | * pages behind kswapd's direction of progress, which would |
1539 | * cause too much scanning of the lower zones. | 1542 | * cause too much scanning of the lower zones. |
1540 | */ | 1543 | */ |
1541 | for (i = 0; i <= end_zone; i++) { | 1544 | for (i = 0; i <= end_zone; i++) { |
1542 | struct zone *zone = pgdat->node_zones + i; | 1545 | struct zone *zone = pgdat->node_zones + i; |
1543 | int nr_slab; | 1546 | int nr_slab; |
1544 | 1547 | ||
1545 | if (!populated_zone(zone)) | 1548 | if (!populated_zone(zone)) |
1546 | continue; | 1549 | continue; |
1547 | 1550 | ||
1548 | if (zone_is_all_unreclaimable(zone) && | 1551 | if (zone_is_all_unreclaimable(zone) && |
1549 | priority != DEF_PRIORITY) | 1552 | priority != DEF_PRIORITY) |
1550 | continue; | 1553 | continue; |
1551 | 1554 | ||
1552 | if (!zone_watermark_ok(zone, order, zone->pages_high, | 1555 | if (!zone_watermark_ok(zone, order, zone->pages_high, |
1553 | end_zone, 0)) | 1556 | end_zone, 0)) |
1554 | all_zones_ok = 0; | 1557 | all_zones_ok = 0; |
1555 | temp_priority[i] = priority; | 1558 | temp_priority[i] = priority; |
1556 | sc.nr_scanned = 0; | 1559 | sc.nr_scanned = 0; |
1557 | note_zone_scanning_priority(zone, priority); | 1560 | note_zone_scanning_priority(zone, priority); |
1558 | /* | 1561 | /* |
1559 | * We put equal pressure on every zone, unless one | 1562 | * We put equal pressure on every zone, unless one |
1560 | * zone has way too many pages free already. | 1563 | * zone has way too many pages free already. |
1561 | */ | 1564 | */ |
1562 | if (!zone_watermark_ok(zone, order, 8*zone->pages_high, | 1565 | if (!zone_watermark_ok(zone, order, 8*zone->pages_high, |
1563 | end_zone, 0)) | 1566 | end_zone, 0)) |
1564 | nr_reclaimed += shrink_zone(priority, zone, &sc); | 1567 | nr_reclaimed += shrink_zone(priority, zone, &sc); |
1565 | reclaim_state->reclaimed_slab = 0; | 1568 | reclaim_state->reclaimed_slab = 0; |
1566 | nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL, | 1569 | nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL, |
1567 | lru_pages); | 1570 | lru_pages); |
1568 | nr_reclaimed += reclaim_state->reclaimed_slab; | 1571 | nr_reclaimed += reclaim_state->reclaimed_slab; |
1569 | total_scanned += sc.nr_scanned; | 1572 | total_scanned += sc.nr_scanned; |
1570 | if (zone_is_all_unreclaimable(zone)) | 1573 | if (zone_is_all_unreclaimable(zone)) |
1571 | continue; | 1574 | continue; |
1572 | if (nr_slab == 0 && zone->pages_scanned >= | 1575 | if (nr_slab == 0 && zone->pages_scanned >= |
1573 | (zone_page_state(zone, NR_ACTIVE) | 1576 | (zone_page_state(zone, NR_ACTIVE) |
1574 | + zone_page_state(zone, NR_INACTIVE)) * 6) | 1577 | + zone_page_state(zone, NR_INACTIVE)) * 6) |
1575 | zone_set_flag(zone, | 1578 | zone_set_flag(zone, |
1576 | ZONE_ALL_UNRECLAIMABLE); | 1579 | ZONE_ALL_UNRECLAIMABLE); |
1577 | /* | 1580 | /* |
1578 | * If we've done a decent amount of scanning and | 1581 | * If we've done a decent amount of scanning and |
1579 | * the reclaim ratio is low, start doing writepage | 1582 | * the reclaim ratio is low, start doing writepage |
1580 | * even in laptop mode | 1583 | * even in laptop mode |
1581 | */ | 1584 | */ |
1582 | if (total_scanned > SWAP_CLUSTER_MAX * 2 && | 1585 | if (total_scanned > SWAP_CLUSTER_MAX * 2 && |
1583 | total_scanned > nr_reclaimed + nr_reclaimed / 2) | 1586 | total_scanned > nr_reclaimed + nr_reclaimed / 2) |
1584 | sc.may_writepage = 1; | 1587 | sc.may_writepage = 1; |
1585 | } | 1588 | } |
1586 | if (all_zones_ok) | 1589 | if (all_zones_ok) |
1587 | break; /* kswapd: all done */ | 1590 | break; /* kswapd: all done */ |
1588 | /* | 1591 | /* |
1589 | * OK, kswapd is getting into trouble. Take a nap, then take | 1592 | * OK, kswapd is getting into trouble. Take a nap, then take |
1590 | * another pass across the zones. | 1593 | * another pass across the zones. |
1591 | */ | 1594 | */ |
1592 | if (total_scanned && priority < DEF_PRIORITY - 2) | 1595 | if (total_scanned && priority < DEF_PRIORITY - 2) |
1593 | congestion_wait(WRITE, HZ/10); | 1596 | congestion_wait(WRITE, HZ/10); |
1594 | 1597 | ||
1595 | /* | 1598 | /* |
1596 | * We do this so kswapd doesn't build up large priorities for | 1599 | * We do this so kswapd doesn't build up large priorities for |
1597 | * example when it is freeing in parallel with allocators. It | 1600 | * example when it is freeing in parallel with allocators. It |
1598 | * matches the direct reclaim path behaviour in terms of impact | 1601 | * matches the direct reclaim path behaviour in terms of impact |
1599 | * on zone->*_priority. | 1602 | * on zone->*_priority. |
1600 | */ | 1603 | */ |
1601 | if (nr_reclaimed >= SWAP_CLUSTER_MAX) | 1604 | if (nr_reclaimed >= SWAP_CLUSTER_MAX) |
1602 | break; | 1605 | break; |
1603 | } | 1606 | } |
1604 | out: | 1607 | out: |
1605 | /* | 1608 | /* |
1606 | * Note within each zone the priority level at which this zone was | 1609 | * Note within each zone the priority level at which this zone was |
1607 | * brought into a happy state. So that the next thread which scans this | 1610 | * brought into a happy state. So that the next thread which scans this |
1608 | * zone will start out at that priority level. | 1611 | * zone will start out at that priority level. |
1609 | */ | 1612 | */ |
1610 | for (i = 0; i < pgdat->nr_zones; i++) { | 1613 | for (i = 0; i < pgdat->nr_zones; i++) { |
1611 | struct zone *zone = pgdat->node_zones + i; | 1614 | struct zone *zone = pgdat->node_zones + i; |
1612 | 1615 | ||
1613 | zone->prev_priority = temp_priority[i]; | 1616 | zone->prev_priority = temp_priority[i]; |
1614 | } | 1617 | } |
1615 | if (!all_zones_ok) { | 1618 | if (!all_zones_ok) { |
1616 | cond_resched(); | 1619 | cond_resched(); |
1617 | 1620 | ||
1618 | try_to_freeze(); | 1621 | try_to_freeze(); |
1619 | 1622 | ||
1620 | goto loop_again; | 1623 | goto loop_again; |
1621 | } | 1624 | } |
1622 | 1625 | ||
1623 | return nr_reclaimed; | 1626 | return nr_reclaimed; |
1624 | } | 1627 | } |
1625 | 1628 | ||
1626 | /* | 1629 | /* |
1627 | * The background pageout daemon, started as a kernel thread | 1630 | * The background pageout daemon, started as a kernel thread |
1628 | * from the init process. | 1631 | * from the init process. |
1629 | * | 1632 | * |
1630 | * This basically trickles out pages so that we have _some_ | 1633 | * This basically trickles out pages so that we have _some_ |
1631 | * free memory available even if there is no other activity | 1634 | * free memory available even if there is no other activity |
1632 | * that frees anything up. This is needed for things like routing | 1635 | * that frees anything up. This is needed for things like routing |
1633 | * etc, where we otherwise might have all activity going on in | 1636 | * etc, where we otherwise might have all activity going on in |
1634 | * asynchronous contexts that cannot page things out. | 1637 | * asynchronous contexts that cannot page things out. |
1635 | * | 1638 | * |
1636 | * If there are applications that are active memory-allocators | 1639 | * If there are applications that are active memory-allocators |
1637 | * (most normal use), this basically shouldn't matter. | 1640 | * (most normal use), this basically shouldn't matter. |
1638 | */ | 1641 | */ |
1639 | static int kswapd(void *p) | 1642 | static int kswapd(void *p) |
1640 | { | 1643 | { |
1641 | unsigned long order; | 1644 | unsigned long order; |
1642 | pg_data_t *pgdat = (pg_data_t*)p; | 1645 | pg_data_t *pgdat = (pg_data_t*)p; |
1643 | struct task_struct *tsk = current; | 1646 | struct task_struct *tsk = current; |
1644 | DEFINE_WAIT(wait); | 1647 | DEFINE_WAIT(wait); |
1645 | struct reclaim_state reclaim_state = { | 1648 | struct reclaim_state reclaim_state = { |
1646 | .reclaimed_slab = 0, | 1649 | .reclaimed_slab = 0, |
1647 | }; | 1650 | }; |
1648 | node_to_cpumask_ptr(cpumask, pgdat->node_id); | 1651 | node_to_cpumask_ptr(cpumask, pgdat->node_id); |
1649 | 1652 | ||
1650 | if (!cpus_empty(*cpumask)) | 1653 | if (!cpus_empty(*cpumask)) |
1651 | set_cpus_allowed_ptr(tsk, cpumask); | 1654 | set_cpus_allowed_ptr(tsk, cpumask); |
1652 | current->reclaim_state = &reclaim_state; | 1655 | current->reclaim_state = &reclaim_state; |
1653 | 1656 | ||
1654 | /* | 1657 | /* |
1655 | * Tell the memory management that we're a "memory allocator", | 1658 | * Tell the memory management that we're a "memory allocator", |
1656 | * and that if we need more memory we should get access to it | 1659 | * and that if we need more memory we should get access to it |
1657 | * regardless (see "__alloc_pages()"). "kswapd" should | 1660 | * regardless (see "__alloc_pages()"). "kswapd" should |
1658 | * never get caught in the normal page freeing logic. | 1661 | * never get caught in the normal page freeing logic. |
1659 | * | 1662 | * |
1660 | * (Kswapd normally doesn't need memory anyway, but sometimes | 1663 | * (Kswapd normally doesn't need memory anyway, but sometimes |
1661 | * you need a small amount of memory in order to be able to | 1664 | * you need a small amount of memory in order to be able to |
1662 | * page out something else, and this flag essentially protects | 1665 | * page out something else, and this flag essentially protects |
1663 | * us from recursively trying to free more memory as we're | 1666 | * us from recursively trying to free more memory as we're |
1664 | * trying to free the first piece of memory in the first place). | 1667 | * trying to free the first piece of memory in the first place). |
1665 | */ | 1668 | */ |
1666 | tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; | 1669 | tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; |
1667 | set_freezable(); | 1670 | set_freezable(); |
1668 | 1671 | ||
1669 | order = 0; | 1672 | order = 0; |
1670 | for ( ; ; ) { | 1673 | for ( ; ; ) { |
1671 | unsigned long new_order; | 1674 | unsigned long new_order; |
1672 | 1675 | ||
1673 | prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE); | 1676 | prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE); |
1674 | new_order = pgdat->kswapd_max_order; | 1677 | new_order = pgdat->kswapd_max_order; |
1675 | pgdat->kswapd_max_order = 0; | 1678 | pgdat->kswapd_max_order = 0; |
1676 | if (order < new_order) { | 1679 | if (order < new_order) { |
1677 | /* | 1680 | /* |
1678 | * Don't sleep if someone wants a larger 'order' | 1681 | * Don't sleep if someone wants a larger 'order' |
1679 | * allocation | 1682 | * allocation |
1680 | */ | 1683 | */ |
1681 | order = new_order; | 1684 | order = new_order; |
1682 | } else { | 1685 | } else { |
1683 | if (!freezing(current)) | 1686 | if (!freezing(current)) |
1684 | schedule(); | 1687 | schedule(); |
1685 | 1688 | ||
1686 | order = pgdat->kswapd_max_order; | 1689 | order = pgdat->kswapd_max_order; |
1687 | } | 1690 | } |
1688 | finish_wait(&pgdat->kswapd_wait, &wait); | 1691 | finish_wait(&pgdat->kswapd_wait, &wait); |
1689 | 1692 | ||
1690 | if (!try_to_freeze()) { | 1693 | if (!try_to_freeze()) { |
1691 | /* We can speed up thawing tasks if we don't call | 1694 | /* We can speed up thawing tasks if we don't call |
1692 | * balance_pgdat after returning from the refrigerator | 1695 | * balance_pgdat after returning from the refrigerator |
1693 | */ | 1696 | */ |
1694 | balance_pgdat(pgdat, order); | 1697 | balance_pgdat(pgdat, order); |
1695 | } | 1698 | } |
1696 | } | 1699 | } |
1697 | return 0; | 1700 | return 0; |
1698 | } | 1701 | } |
1699 | 1702 | ||
1700 | /* | 1703 | /* |
1701 | * A zone is low on free memory, so wake its kswapd task to service it. | 1704 | * A zone is low on free memory, so wake its kswapd task to service it. |
1702 | */ | 1705 | */ |
1703 | void wakeup_kswapd(struct zone *zone, int order) | 1706 | void wakeup_kswapd(struct zone *zone, int order) |
1704 | { | 1707 | { |
1705 | pg_data_t *pgdat; | 1708 | pg_data_t *pgdat; |
1706 | 1709 | ||
1707 | if (!populated_zone(zone)) | 1710 | if (!populated_zone(zone)) |
1708 | return; | 1711 | return; |
1709 | 1712 | ||
1710 | pgdat = zone->zone_pgdat; | 1713 | pgdat = zone->zone_pgdat; |
1711 | if (zone_watermark_ok(zone, order, zone->pages_low, 0, 0)) | 1714 | if (zone_watermark_ok(zone, order, zone->pages_low, 0, 0)) |
1712 | return; | 1715 | return; |
1713 | if (pgdat->kswapd_max_order < order) | 1716 | if (pgdat->kswapd_max_order < order) |
1714 | pgdat->kswapd_max_order = order; | 1717 | pgdat->kswapd_max_order = order; |
1715 | if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) | 1718 | if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) |
1716 | return; | 1719 | return; |
1717 | if (!waitqueue_active(&pgdat->kswapd_wait)) | 1720 | if (!waitqueue_active(&pgdat->kswapd_wait)) |
1718 | return; | 1721 | return; |
1719 | wake_up_interruptible(&pgdat->kswapd_wait); | 1722 | wake_up_interruptible(&pgdat->kswapd_wait); |
1720 | } | 1723 | } |
1721 | 1724 | ||
1722 | #ifdef CONFIG_PM | 1725 | #ifdef CONFIG_PM |
1723 | /* | 1726 | /* |
1724 | * Helper function for shrink_all_memory(). Tries to reclaim 'nr_pages' pages | 1727 | * Helper function for shrink_all_memory(). Tries to reclaim 'nr_pages' pages |
1725 | * from LRU lists system-wide, for given pass and priority, and returns the | 1728 | * from LRU lists system-wide, for given pass and priority, and returns the |
1726 | * number of reclaimed pages | 1729 | * number of reclaimed pages |
1727 | * | 1730 | * |
1728 | * For pass > 3 we also try to shrink the LRU lists that contain a few pages | 1731 | * For pass > 3 we also try to shrink the LRU lists that contain a few pages |
1729 | */ | 1732 | */ |
1730 | static unsigned long shrink_all_zones(unsigned long nr_pages, int prio, | 1733 | static unsigned long shrink_all_zones(unsigned long nr_pages, int prio, |
1731 | int pass, struct scan_control *sc) | 1734 | int pass, struct scan_control *sc) |
1732 | { | 1735 | { |
1733 | struct zone *zone; | 1736 | struct zone *zone; |
1734 | unsigned long nr_to_scan, ret = 0; | 1737 | unsigned long nr_to_scan, ret = 0; |
1735 | 1738 | ||
1736 | for_each_zone(zone) { | 1739 | for_each_zone(zone) { |
1737 | 1740 | ||
1738 | if (!populated_zone(zone)) | 1741 | if (!populated_zone(zone)) |
1739 | continue; | 1742 | continue; |
1740 | 1743 | ||
1741 | if (zone_is_all_unreclaimable(zone) && prio != DEF_PRIORITY) | 1744 | if (zone_is_all_unreclaimable(zone) && prio != DEF_PRIORITY) |
1742 | continue; | 1745 | continue; |
1743 | 1746 | ||
1744 | /* For pass = 0 we don't shrink the active list */ | 1747 | /* For pass = 0 we don't shrink the active list */ |
1745 | if (pass > 0) { | 1748 | if (pass > 0) { |
1746 | zone->nr_scan_active += | 1749 | zone->nr_scan_active += |
1747 | (zone_page_state(zone, NR_ACTIVE) >> prio) + 1; | 1750 | (zone_page_state(zone, NR_ACTIVE) >> prio) + 1; |
1748 | if (zone->nr_scan_active >= nr_pages || pass > 3) { | 1751 | if (zone->nr_scan_active >= nr_pages || pass > 3) { |
1749 | zone->nr_scan_active = 0; | 1752 | zone->nr_scan_active = 0; |
1750 | nr_to_scan = min(nr_pages, | 1753 | nr_to_scan = min(nr_pages, |
1751 | zone_page_state(zone, NR_ACTIVE)); | 1754 | zone_page_state(zone, NR_ACTIVE)); |
1752 | shrink_active_list(nr_to_scan, zone, sc, prio); | 1755 | shrink_active_list(nr_to_scan, zone, sc, prio); |
1753 | } | 1756 | } |
1754 | } | 1757 | } |
1755 | 1758 | ||
1756 | zone->nr_scan_inactive += | 1759 | zone->nr_scan_inactive += |
1757 | (zone_page_state(zone, NR_INACTIVE) >> prio) + 1; | 1760 | (zone_page_state(zone, NR_INACTIVE) >> prio) + 1; |
1758 | if (zone->nr_scan_inactive >= nr_pages || pass > 3) { | 1761 | if (zone->nr_scan_inactive >= nr_pages || pass > 3) { |
1759 | zone->nr_scan_inactive = 0; | 1762 | zone->nr_scan_inactive = 0; |
1760 | nr_to_scan = min(nr_pages, | 1763 | nr_to_scan = min(nr_pages, |
1761 | zone_page_state(zone, NR_INACTIVE)); | 1764 | zone_page_state(zone, NR_INACTIVE)); |
1762 | ret += shrink_inactive_list(nr_to_scan, zone, sc); | 1765 | ret += shrink_inactive_list(nr_to_scan, zone, sc); |
1763 | if (ret >= nr_pages) | 1766 | if (ret >= nr_pages) |
1764 | return ret; | 1767 | return ret; |
1765 | } | 1768 | } |
1766 | } | 1769 | } |
1767 | 1770 | ||
1768 | return ret; | 1771 | return ret; |
1769 | } | 1772 | } |
1770 | 1773 | ||
1771 | static unsigned long count_lru_pages(void) | 1774 | static unsigned long count_lru_pages(void) |
1772 | { | 1775 | { |
1773 | return global_page_state(NR_ACTIVE) + global_page_state(NR_INACTIVE); | 1776 | return global_page_state(NR_ACTIVE) + global_page_state(NR_INACTIVE); |
1774 | } | 1777 | } |
1775 | 1778 | ||
1776 | /* | 1779 | /* |
1777 | * Try to free `nr_pages' of memory, system-wide, and return the number of | 1780 | * Try to free `nr_pages' of memory, system-wide, and return the number of |
1778 | * freed pages. | 1781 | * freed pages. |
1779 | * | 1782 | * |
1780 | * Rather than trying to age LRUs the aim is to preserve the overall | 1783 | * Rather than trying to age LRUs the aim is to preserve the overall |
1781 | * LRU order by reclaiming preferentially | 1784 | * LRU order by reclaiming preferentially |
1782 | * inactive > active > active referenced > active mapped | 1785 | * inactive > active > active referenced > active mapped |
1783 | */ | 1786 | */ |
1784 | unsigned long shrink_all_memory(unsigned long nr_pages) | 1787 | unsigned long shrink_all_memory(unsigned long nr_pages) |
1785 | { | 1788 | { |
1786 | unsigned long lru_pages, nr_slab; | 1789 | unsigned long lru_pages, nr_slab; |
1787 | unsigned long ret = 0; | 1790 | unsigned long ret = 0; |
1788 | int pass; | 1791 | int pass; |
1789 | struct reclaim_state reclaim_state; | 1792 | struct reclaim_state reclaim_state; |
1790 | struct scan_control sc = { | 1793 | struct scan_control sc = { |
1791 | .gfp_mask = GFP_KERNEL, | 1794 | .gfp_mask = GFP_KERNEL, |
1792 | .may_swap = 0, | 1795 | .may_swap = 0, |
1793 | .swap_cluster_max = nr_pages, | 1796 | .swap_cluster_max = nr_pages, |
1794 | .may_writepage = 1, | 1797 | .may_writepage = 1, |
1795 | .swappiness = vm_swappiness, | 1798 | .swappiness = vm_swappiness, |
1796 | .isolate_pages = isolate_pages_global, | 1799 | .isolate_pages = isolate_pages_global, |
1797 | }; | 1800 | }; |
1798 | 1801 | ||
1799 | current->reclaim_state = &reclaim_state; | 1802 | current->reclaim_state = &reclaim_state; |
1800 | 1803 | ||
1801 | lru_pages = count_lru_pages(); | 1804 | lru_pages = count_lru_pages(); |
1802 | nr_slab = global_page_state(NR_SLAB_RECLAIMABLE); | 1805 | nr_slab = global_page_state(NR_SLAB_RECLAIMABLE); |
1803 | /* If slab caches are huge, it's better to hit them first */ | 1806 | /* If slab caches are huge, it's better to hit them first */ |
1804 | while (nr_slab >= lru_pages) { | 1807 | while (nr_slab >= lru_pages) { |
1805 | reclaim_state.reclaimed_slab = 0; | 1808 | reclaim_state.reclaimed_slab = 0; |
1806 | shrink_slab(nr_pages, sc.gfp_mask, lru_pages); | 1809 | shrink_slab(nr_pages, sc.gfp_mask, lru_pages); |
1807 | if (!reclaim_state.reclaimed_slab) | 1810 | if (!reclaim_state.reclaimed_slab) |
1808 | break; | 1811 | break; |
1809 | 1812 | ||
1810 | ret += reclaim_state.reclaimed_slab; | 1813 | ret += reclaim_state.reclaimed_slab; |
1811 | if (ret >= nr_pages) | 1814 | if (ret >= nr_pages) |
1812 | goto out; | 1815 | goto out; |
1813 | 1816 | ||
1814 | nr_slab -= reclaim_state.reclaimed_slab; | 1817 | nr_slab -= reclaim_state.reclaimed_slab; |
1815 | } | 1818 | } |
1816 | 1819 | ||
1817 | /* | 1820 | /* |
1818 | * We try to shrink LRUs in 5 passes: | 1821 | * We try to shrink LRUs in 5 passes: |
1819 | * 0 = Reclaim from inactive_list only | 1822 | * 0 = Reclaim from inactive_list only |
1820 | * 1 = Reclaim from active list but don't reclaim mapped | 1823 | * 1 = Reclaim from active list but don't reclaim mapped |
1821 | * 2 = 2nd pass of type 1 | 1824 | * 2 = 2nd pass of type 1 |
1822 | * 3 = Reclaim mapped (normal reclaim) | 1825 | * 3 = Reclaim mapped (normal reclaim) |
1823 | * 4 = 2nd pass of type 3 | 1826 | * 4 = 2nd pass of type 3 |
1824 | */ | 1827 | */ |
1825 | for (pass = 0; pass < 5; pass++) { | 1828 | for (pass = 0; pass < 5; pass++) { |
1826 | int prio; | 1829 | int prio; |
1827 | 1830 | ||
1828 | /* Force reclaiming mapped pages in the passes #3 and #4 */ | 1831 | /* Force reclaiming mapped pages in the passes #3 and #4 */ |
1829 | if (pass > 2) { | 1832 | if (pass > 2) { |
1830 | sc.may_swap = 1; | 1833 | sc.may_swap = 1; |
1831 | sc.swappiness = 100; | 1834 | sc.swappiness = 100; |
1832 | } | 1835 | } |
1833 | 1836 | ||
1834 | for (prio = DEF_PRIORITY; prio >= 0; prio--) { | 1837 | for (prio = DEF_PRIORITY; prio >= 0; prio--) { |
1835 | unsigned long nr_to_scan = nr_pages - ret; | 1838 | unsigned long nr_to_scan = nr_pages - ret; |
1836 | 1839 | ||
1837 | sc.nr_scanned = 0; | 1840 | sc.nr_scanned = 0; |
1838 | ret += shrink_all_zones(nr_to_scan, prio, pass, &sc); | 1841 | ret += shrink_all_zones(nr_to_scan, prio, pass, &sc); |
1839 | if (ret >= nr_pages) | 1842 | if (ret >= nr_pages) |
1840 | goto out; | 1843 | goto out; |
1841 | 1844 | ||
1842 | reclaim_state.reclaimed_slab = 0; | 1845 | reclaim_state.reclaimed_slab = 0; |
1843 | shrink_slab(sc.nr_scanned, sc.gfp_mask, | 1846 | shrink_slab(sc.nr_scanned, sc.gfp_mask, |
1844 | count_lru_pages()); | 1847 | count_lru_pages()); |
1845 | ret += reclaim_state.reclaimed_slab; | 1848 | ret += reclaim_state.reclaimed_slab; |
1846 | if (ret >= nr_pages) | 1849 | if (ret >= nr_pages) |
1847 | goto out; | 1850 | goto out; |
1848 | 1851 | ||
1849 | if (sc.nr_scanned && prio < DEF_PRIORITY - 2) | 1852 | if (sc.nr_scanned && prio < DEF_PRIORITY - 2) |
1850 | congestion_wait(WRITE, HZ / 10); | 1853 | congestion_wait(WRITE, HZ / 10); |
1851 | } | 1854 | } |
1852 | } | 1855 | } |
1853 | 1856 | ||
1854 | /* | 1857 | /* |
1855 | * If ret = 0, we could not shrink LRUs, but there may be something | 1858 | * If ret = 0, we could not shrink LRUs, but there may be something |
1856 | * in slab caches | 1859 | * in slab caches |
1857 | */ | 1860 | */ |
1858 | if (!ret) { | 1861 | if (!ret) { |
1859 | do { | 1862 | do { |
1860 | reclaim_state.reclaimed_slab = 0; | 1863 | reclaim_state.reclaimed_slab = 0; |
1861 | shrink_slab(nr_pages, sc.gfp_mask, count_lru_pages()); | 1864 | shrink_slab(nr_pages, sc.gfp_mask, count_lru_pages()); |
1862 | ret += reclaim_state.reclaimed_slab; | 1865 | ret += reclaim_state.reclaimed_slab; |
1863 | } while (ret < nr_pages && reclaim_state.reclaimed_slab > 0); | 1866 | } while (ret < nr_pages && reclaim_state.reclaimed_slab > 0); |
1864 | } | 1867 | } |
1865 | 1868 | ||
1866 | out: | 1869 | out: |
1867 | current->reclaim_state = NULL; | 1870 | current->reclaim_state = NULL; |
1868 | 1871 | ||
1869 | return ret; | 1872 | return ret; |
1870 | } | 1873 | } |
1871 | #endif | 1874 | #endif |
1872 | 1875 | ||
1873 | /* It's optimal to keep kswapds on the same CPUs as their memory, but | 1876 | /* It's optimal to keep kswapds on the same CPUs as their memory, but |
1874 | not required for correctness. So if the last cpu in a node goes | 1877 | not required for correctness. So if the last cpu in a node goes |
1875 | away, we get changed to run anywhere: as the first one comes back, | 1878 | away, we get changed to run anywhere: as the first one comes back, |
1876 | restore their cpu bindings. */ | 1879 | restore their cpu bindings. */ |
1877 | static int __devinit cpu_callback(struct notifier_block *nfb, | 1880 | static int __devinit cpu_callback(struct notifier_block *nfb, |
1878 | unsigned long action, void *hcpu) | 1881 | unsigned long action, void *hcpu) |
1879 | { | 1882 | { |
1880 | int nid; | 1883 | int nid; |
1881 | 1884 | ||
1882 | if (action == CPU_ONLINE || action == CPU_ONLINE_FROZEN) { | 1885 | if (action == CPU_ONLINE || action == CPU_ONLINE_FROZEN) { |
1883 | for_each_node_state(nid, N_HIGH_MEMORY) { | 1886 | for_each_node_state(nid, N_HIGH_MEMORY) { |
1884 | pg_data_t *pgdat = NODE_DATA(nid); | 1887 | pg_data_t *pgdat = NODE_DATA(nid); |
1885 | node_to_cpumask_ptr(mask, pgdat->node_id); | 1888 | node_to_cpumask_ptr(mask, pgdat->node_id); |
1886 | 1889 | ||
1887 | if (any_online_cpu(*mask) < nr_cpu_ids) | 1890 | if (any_online_cpu(*mask) < nr_cpu_ids) |
1888 | /* One of our CPUs online: restore mask */ | 1891 | /* One of our CPUs online: restore mask */ |
1889 | set_cpus_allowed_ptr(pgdat->kswapd, mask); | 1892 | set_cpus_allowed_ptr(pgdat->kswapd, mask); |
1890 | } | 1893 | } |
1891 | } | 1894 | } |
1892 | return NOTIFY_OK; | 1895 | return NOTIFY_OK; |
1893 | } | 1896 | } |
1894 | 1897 | ||
1895 | /* | 1898 | /* |
1896 | * This kswapd start function will be called by init and node-hot-add. | 1899 | * This kswapd start function will be called by init and node-hot-add. |
1897 | * On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added. | 1900 | * On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added. |
1898 | */ | 1901 | */ |
1899 | int kswapd_run(int nid) | 1902 | int kswapd_run(int nid) |
1900 | { | 1903 | { |
1901 | pg_data_t *pgdat = NODE_DATA(nid); | 1904 | pg_data_t *pgdat = NODE_DATA(nid); |
1902 | int ret = 0; | 1905 | int ret = 0; |
1903 | 1906 | ||
1904 | if (pgdat->kswapd) | 1907 | if (pgdat->kswapd) |
1905 | return 0; | 1908 | return 0; |
1906 | 1909 | ||
1907 | pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid); | 1910 | pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid); |
1908 | if (IS_ERR(pgdat->kswapd)) { | 1911 | if (IS_ERR(pgdat->kswapd)) { |
1909 | /* failure at boot is fatal */ | 1912 | /* failure at boot is fatal */ |
1910 | BUG_ON(system_state == SYSTEM_BOOTING); | 1913 | BUG_ON(system_state == SYSTEM_BOOTING); |
1911 | printk("Failed to start kswapd on node %d\n",nid); | 1914 | printk("Failed to start kswapd on node %d\n",nid); |
1912 | ret = -1; | 1915 | ret = -1; |
1913 | } | 1916 | } |
1914 | return ret; | 1917 | return ret; |
1915 | } | 1918 | } |
1916 | 1919 | ||
1917 | static int __init kswapd_init(void) | 1920 | static int __init kswapd_init(void) |
1918 | { | 1921 | { |
1919 | int nid; | 1922 | int nid; |
1920 | 1923 | ||
1921 | swap_setup(); | 1924 | swap_setup(); |
1922 | for_each_node_state(nid, N_HIGH_MEMORY) | 1925 | for_each_node_state(nid, N_HIGH_MEMORY) |
1923 | kswapd_run(nid); | 1926 | kswapd_run(nid); |
1924 | hotcpu_notifier(cpu_callback, 0); | 1927 | hotcpu_notifier(cpu_callback, 0); |
1925 | return 0; | 1928 | return 0; |
1926 | } | 1929 | } |
1927 | 1930 | ||
1928 | module_init(kswapd_init) | 1931 | module_init(kswapd_init) |
1929 | 1932 | ||
1930 | #ifdef CONFIG_NUMA | 1933 | #ifdef CONFIG_NUMA |
1931 | /* | 1934 | /* |
1932 | * Zone reclaim mode | 1935 | * Zone reclaim mode |
1933 | * | 1936 | * |
1934 | * If non-zero call zone_reclaim when the number of free pages falls below | 1937 | * If non-zero call zone_reclaim when the number of free pages falls below |
1935 | * the watermarks. | 1938 | * the watermarks. |
1936 | */ | 1939 | */ |
1937 | int zone_reclaim_mode __read_mostly; | 1940 | int zone_reclaim_mode __read_mostly; |
1938 | 1941 | ||
1939 | #define RECLAIM_OFF 0 | 1942 | #define RECLAIM_OFF 0 |
1940 | #define RECLAIM_ZONE (1<<0) /* Run shrink_cache on the zone */ | 1943 | #define RECLAIM_ZONE (1<<0) /* Run shrink_cache on the zone */ |
1941 | #define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */ | 1944 | #define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */ |
1942 | #define RECLAIM_SWAP (1<<2) /* Swap pages out during reclaim */ | 1945 | #define RECLAIM_SWAP (1<<2) /* Swap pages out during reclaim */ |
1943 | 1946 | ||
1944 | /* | 1947 | /* |
1945 | * Priority for ZONE_RECLAIM. This determines the fraction of pages | 1948 | * Priority for ZONE_RECLAIM. This determines the fraction of pages |
1946 | * of a node considered for each zone_reclaim. 4 scans 1/16th of | 1949 | * of a node considered for each zone_reclaim. 4 scans 1/16th of |
1947 | * a zone. | 1950 | * a zone. |
1948 | */ | 1951 | */ |
1949 | #define ZONE_RECLAIM_PRIORITY 4 | 1952 | #define ZONE_RECLAIM_PRIORITY 4 |
1950 | 1953 | ||
1951 | /* | 1954 | /* |
1952 | * Percentage of pages in a zone that must be unmapped for zone_reclaim to | 1955 | * Percentage of pages in a zone that must be unmapped for zone_reclaim to |
1953 | * occur. | 1956 | * occur. |
1954 | */ | 1957 | */ |
1955 | int sysctl_min_unmapped_ratio = 1; | 1958 | int sysctl_min_unmapped_ratio = 1; |
1956 | 1959 | ||
1957 | /* | 1960 | /* |
1958 | * If the number of slab pages in a zone grows beyond this percentage then | 1961 | * If the number of slab pages in a zone grows beyond this percentage then |
1959 | * slab reclaim needs to occur. | 1962 | * slab reclaim needs to occur. |
1960 | */ | 1963 | */ |
1961 | int sysctl_min_slab_ratio = 5; | 1964 | int sysctl_min_slab_ratio = 5; |
1962 | 1965 | ||
1963 | /* | 1966 | /* |
1964 | * Try to free up some pages from this zone through reclaim. | 1967 | * Try to free up some pages from this zone through reclaim. |
1965 | */ | 1968 | */ |
1966 | static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) | 1969 | static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) |
1967 | { | 1970 | { |
1968 | /* Minimum pages needed in order to stay on node */ | 1971 | /* Minimum pages needed in order to stay on node */ |
1969 | const unsigned long nr_pages = 1 << order; | 1972 | const unsigned long nr_pages = 1 << order; |
1970 | struct task_struct *p = current; | 1973 | struct task_struct *p = current; |
1971 | struct reclaim_state reclaim_state; | 1974 | struct reclaim_state reclaim_state; |
1972 | int priority; | 1975 | int priority; |
1973 | unsigned long nr_reclaimed = 0; | 1976 | unsigned long nr_reclaimed = 0; |
1974 | struct scan_control sc = { | 1977 | struct scan_control sc = { |
1975 | .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE), | 1978 | .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE), |
1976 | .may_swap = !!(zone_reclaim_mode & RECLAIM_SWAP), | 1979 | .may_swap = !!(zone_reclaim_mode & RECLAIM_SWAP), |
1977 | .swap_cluster_max = max_t(unsigned long, nr_pages, | 1980 | .swap_cluster_max = max_t(unsigned long, nr_pages, |
1978 | SWAP_CLUSTER_MAX), | 1981 | SWAP_CLUSTER_MAX), |
1979 | .gfp_mask = gfp_mask, | 1982 | .gfp_mask = gfp_mask, |
1980 | .swappiness = vm_swappiness, | 1983 | .swappiness = vm_swappiness, |
1981 | .isolate_pages = isolate_pages_global, | 1984 | .isolate_pages = isolate_pages_global, |
1982 | }; | 1985 | }; |
1983 | unsigned long slab_reclaimable; | 1986 | unsigned long slab_reclaimable; |
1984 | 1987 | ||
1985 | disable_swap_token(); | 1988 | disable_swap_token(); |
1986 | cond_resched(); | 1989 | cond_resched(); |
1987 | /* | 1990 | /* |
1988 | * We need to be able to allocate from the reserves for RECLAIM_SWAP | 1991 | * We need to be able to allocate from the reserves for RECLAIM_SWAP |
1989 | * and we also need to be able to write out pages for RECLAIM_WRITE | 1992 | * and we also need to be able to write out pages for RECLAIM_WRITE |
1990 | * and RECLAIM_SWAP. | 1993 | * and RECLAIM_SWAP. |
1991 | */ | 1994 | */ |
1992 | p->flags |= PF_MEMALLOC | PF_SWAPWRITE; | 1995 | p->flags |= PF_MEMALLOC | PF_SWAPWRITE; |
1993 | reclaim_state.reclaimed_slab = 0; | 1996 | reclaim_state.reclaimed_slab = 0; |
1994 | p->reclaim_state = &reclaim_state; | 1997 | p->reclaim_state = &reclaim_state; |
1995 | 1998 | ||
1996 | if (zone_page_state(zone, NR_FILE_PAGES) - | 1999 | if (zone_page_state(zone, NR_FILE_PAGES) - |
1997 | zone_page_state(zone, NR_FILE_MAPPED) > | 2000 | zone_page_state(zone, NR_FILE_MAPPED) > |
1998 | zone->min_unmapped_pages) { | 2001 | zone->min_unmapped_pages) { |
1999 | /* | 2002 | /* |
2000 | * Free memory by calling shrink zone with increasing | 2003 | * Free memory by calling shrink zone with increasing |
2001 | * priorities until we have enough memory freed. | 2004 | * priorities until we have enough memory freed. |
2002 | */ | 2005 | */ |
2003 | priority = ZONE_RECLAIM_PRIORITY; | 2006 | priority = ZONE_RECLAIM_PRIORITY; |
2004 | do { | 2007 | do { |
2005 | note_zone_scanning_priority(zone, priority); | 2008 | note_zone_scanning_priority(zone, priority); |
2006 | nr_reclaimed += shrink_zone(priority, zone, &sc); | 2009 | nr_reclaimed += shrink_zone(priority, zone, &sc); |
2007 | priority--; | 2010 | priority--; |
2008 | } while (priority >= 0 && nr_reclaimed < nr_pages); | 2011 | } while (priority >= 0 && nr_reclaimed < nr_pages); |
2009 | } | 2012 | } |
2010 | 2013 | ||
2011 | slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE); | 2014 | slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE); |
2012 | if (slab_reclaimable > zone->min_slab_pages) { | 2015 | if (slab_reclaimable > zone->min_slab_pages) { |
2013 | /* | 2016 | /* |
2014 | * shrink_slab() does not currently allow us to determine how | 2017 | * shrink_slab() does not currently allow us to determine how |
2015 | * many pages were freed in this zone. So we take the current | 2018 | * many pages were freed in this zone. So we take the current |
2016 | * number of slab pages and shake the slab until it is reduced | 2019 | * number of slab pages and shake the slab until it is reduced |
2017 | * by the same nr_pages that we used for reclaiming unmapped | 2020 | * by the same nr_pages that we used for reclaiming unmapped |
2018 | * pages. | 2021 | * pages. |
2019 | * | 2022 | * |
2020 | * Note that shrink_slab will free memory on all zones and may | 2023 | * Note that shrink_slab will free memory on all zones and may |
2021 | * take a long time. | 2024 | * take a long time. |
2022 | */ | 2025 | */ |
2023 | while (shrink_slab(sc.nr_scanned, gfp_mask, order) && | 2026 | while (shrink_slab(sc.nr_scanned, gfp_mask, order) && |
2024 | zone_page_state(zone, NR_SLAB_RECLAIMABLE) > | 2027 | zone_page_state(zone, NR_SLAB_RECLAIMABLE) > |
2025 | slab_reclaimable - nr_pages) | 2028 | slab_reclaimable - nr_pages) |
2026 | ; | 2029 | ; |
2027 | 2030 | ||
2028 | /* | 2031 | /* |
2029 | * Update nr_reclaimed by the number of slab pages we | 2032 | * Update nr_reclaimed by the number of slab pages we |
2030 | * reclaimed from this zone. | 2033 | * reclaimed from this zone. |
2031 | */ | 2034 | */ |
2032 | nr_reclaimed += slab_reclaimable - | 2035 | nr_reclaimed += slab_reclaimable - |
2033 | zone_page_state(zone, NR_SLAB_RECLAIMABLE); | 2036 | zone_page_state(zone, NR_SLAB_RECLAIMABLE); |
2034 | } | 2037 | } |
2035 | 2038 | ||
2036 | p->reclaim_state = NULL; | 2039 | p->reclaim_state = NULL; |
2037 | current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE); | 2040 | current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE); |
2038 | return nr_reclaimed >= nr_pages; | 2041 | return nr_reclaimed >= nr_pages; |
2039 | } | 2042 | } |
2040 | 2043 | ||
2041 | int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) | 2044 | int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) |
2042 | { | 2045 | { |
2043 | int node_id; | 2046 | int node_id; |
2044 | int ret; | 2047 | int ret; |
2045 | 2048 | ||
2046 | /* | 2049 | /* |
2047 | * Zone reclaim reclaims unmapped file backed pages and | 2050 | * Zone reclaim reclaims unmapped file backed pages and |
2048 | * slab pages if we are over the defined limits. | 2051 | * slab pages if we are over the defined limits. |
2049 | * | 2052 | * |
2050 | * A small portion of unmapped file backed pages is needed for | 2053 | * A small portion of unmapped file backed pages is needed for |
2051 | * file I/O otherwise pages read by file I/O will be immediately | 2054 | * file I/O otherwise pages read by file I/O will be immediately |
2052 | * thrown out if the zone is overallocated. So we do not reclaim | 2055 | * thrown out if the zone is overallocated. So we do not reclaim |
2053 | * if less than a specified percentage of the zone is used by | 2056 | * if less than a specified percentage of the zone is used by |
2054 | * unmapped file backed pages. | 2057 | * unmapped file backed pages. |
2055 | */ | 2058 | */ |
2056 | if (zone_page_state(zone, NR_FILE_PAGES) - | 2059 | if (zone_page_state(zone, NR_FILE_PAGES) - |
2057 | zone_page_state(zone, NR_FILE_MAPPED) <= zone->min_unmapped_pages | 2060 | zone_page_state(zone, NR_FILE_MAPPED) <= zone->min_unmapped_pages |
2058 | && zone_page_state(zone, NR_SLAB_RECLAIMABLE) | 2061 | && zone_page_state(zone, NR_SLAB_RECLAIMABLE) |
2059 | <= zone->min_slab_pages) | 2062 | <= zone->min_slab_pages) |
2060 | return 0; | 2063 | return 0; |
2061 | 2064 | ||
2062 | if (zone_is_all_unreclaimable(zone)) | 2065 | if (zone_is_all_unreclaimable(zone)) |
2063 | return 0; | 2066 | return 0; |
2064 | 2067 | ||
2065 | /* | 2068 | /* |
2066 | * Do not scan if the allocation should not be delayed. | 2069 | * Do not scan if the allocation should not be delayed. |
2067 | */ | 2070 | */ |
2068 | if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC)) | 2071 | if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC)) |
2069 | return 0; | 2072 | return 0; |
2070 | 2073 | ||
2071 | /* | 2074 | /* |
2072 | * Only run zone reclaim on the local zone or on zones that do not | 2075 | * Only run zone reclaim on the local zone or on zones that do not |
2073 | * have associated processors. This will favor the local processor | 2076 | * have associated processors. This will favor the local processor |
2074 | * over remote processors and spread off node memory allocations | 2077 | * over remote processors and spread off node memory allocations |
2075 | * as wide as possible. | 2078 | * as wide as possible. |
2076 | */ | 2079 | */ |
2077 | node_id = zone_to_nid(zone); | 2080 | node_id = zone_to_nid(zone); |
2078 | if (node_state(node_id, N_CPU) && node_id != numa_node_id()) | 2081 | if (node_state(node_id, N_CPU) && node_id != numa_node_id()) |
2079 | return 0; | 2082 | return 0; |
2080 | 2083 | ||
2081 | if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED)) | 2084 | if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED)) |
2082 | return 0; | 2085 | return 0; |
2083 | ret = __zone_reclaim(zone, gfp_mask, order); | 2086 | ret = __zone_reclaim(zone, gfp_mask, order); |
2084 | zone_clear_flag(zone, ZONE_RECLAIM_LOCKED); | 2087 | zone_clear_flag(zone, ZONE_RECLAIM_LOCKED); |
2085 | 2088 | ||
2086 | return ret; | 2089 | return ret; |
2087 | } | 2090 | } |
2088 | #endif | 2091 | #endif |
2089 | 2092 |